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Preface 


This volume contains articles from the 7th International Brain Lesion Workshop 
(BrainLes 2021), as well as the RSNA-ASNR-MICCAI Brain Tumor Segmentation 
(BraTS 2021) Challenge, the Federated Tumor Segmentation (FeTS 2021) Challenge, 
the Cross-Modality Domain Adaptation (CrossMoDA 2021) Challenge, and the 
challenge on Quantification of Uncertainties in Biomedical Image Quantification 
(QUBIQ 2021). All these events were held in conjunction with the Medical Image 
Computing and Computer Assisted Intervention (MICCAI) conference on September 
27, 2021, in Strasbourg, France, taking place online due to COVID-19 restrictions. 

The presented manuscripts describe the research of computational scientists 
and clinical researchers working on glioma, multiple sclerosis, cerebral stroke, 
traumatic brain injuries, vestibular schwannoma, and white matter hyper-intensities of 
presumed vascular origin. This compilation does not claim to provide a comprehensive 
understanding from all points of view; however, the authors present their latest advances 
in segmentation, disease prognosis, and other applications in the clinical context. 

The volume is divided into five chapters: the first chapter comprises invited papers 
summarizing the presentations of the keynotes during the full-day BrainLes workshop 
and the FeTS challenge, the second includes the accepted paper submissions to the 
BrainLes workshop, and the third through the sixth chapters contain a selection of papers 
regarding methods presented at the RSNA-ASNR-MICCAI BraTS, FeTS, CrossMoDA, 
and QUBIQ challenges, respectively. 

The content of the first chapter with the invited papers covers the current 
state-of-the-art literature on federated learning applications for cancer research and 
clinical oncology analysis, as well as an overview of the deep learning approaches 
improving the current standard of care for brain lesions and current neuroimaging 
challenges. 

The aim of the second chapter, focusing on the accepted BrainLes workshop 
submissions, is to provide an overview of new advances of medical image analysis in all 
the aforementioned brain pathologies. It brings together researchers from the medical 
image analysis domain, neurologists, and radiologists working on at least one of these 
diseases. The aim is to consider neuroimaging biomarkers used for one disease applied 
to the other diseases. This session did not have a specific dataset to be used. 

The third chapter focuses on a selection of papers from the RSNA-ASNR- 
MICCAI BraTS 2021 challenge participants. BraTS 2021 made publicly available 
the largest ever manually annotated dataset of baseline pre-operative brain glioma 
scans from 20 international institutions in order to gauge the current state of the art 
in automated brain tumor segmentation using skull-stripped multi-parametric MRI 
sequences (provided in NIfTI file format) and to compare different methods. To pinpoint 
and evaluate the clinical relevance of tumor segmentation, BraTS 2021 also included 
the prediction of the MGMT methylation status using the same skull-stripped multi- 
parametric MRI sequences but provided in the DICOM file format to conform to the 
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clinical standards (https://www.rsna.org/education/ai-resources-and-training/ai-image- 
challenge/brain-tumor-ai-challenge-2021). 

The fourth chapter contains a selection of papers from the Federated Tumor 
Segmentation (FeTS 2021) challenge participants. This was the first computational 
challenge focussing on federated learning, and ample multi-institutional routine 
clinically-acquired pre-operative baseline multi-parametric MRI scans of radiograph- 
ically appearing glioblastoma were provided to the participants, along with splits on the 
basis of the site of acquisition. The goal of the challenge was two-fold: i) identify the 
best way to aggregate the knowledge coming from segmentation models trained on the 
individual institutions, and ii) find the best algorithm that produces robust and accurate 
brain tumor segmentations across different medical institutions, MRI scanners, image 
acquisition parameters, and populations. Interestingly, the second task was performed by 
actually circulating the containerized algorithms across different institutions, leveraging 
the collaborators of the largest real-world federation to date (www.fets.ai). 

The fifth chapter contains a selection of papers from the CrossMoDA 2021 challenge 
participants. CrossMoDA 2021 was the first large and multi-class benchmark for 
unsupervised cross-modality domain adaptation for medical image segmentation. The 
goal of the challenge was to segment two key brain structures involved in the follow-up 
and treatment planning of vestibular schwannoma (VS): the VS tumour and the cochlea. 
The training dataset provides annotated T1 scans (N = 105) and unpaired non-annotated 
T2 scans (N = 105). More information can be found on the challenge website (https:// 
crossmoda-challenge.ml1/). 

The sixth chapter contains a selection of papers from the QUBIQ 2021 challenge 
participants. QUBIQ 2021 continued the success of the first challenge on uncertainty 
quantification in medical image segmentation (QUBIQ 2020). The goal of the challenge 
was to model the uncertainty in diverse segmentation tasks in which the involved images 
include different modalities, e.g., CT and MRI scans and varied organs and pathologies. 
QUBIQ 2021 included two new 3D segmentation tasks, pancreas segmentation and 
pancreatic lesion segmentation. 

We heartily hope that this volume will promote further exciting computational 
research on brain related pathologies. 
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Abstract. Machine learning has revolutionized every facet of human 
life, while also becoming more accessible and ubiquitous. Its prevalence 
has had a powerful impact in healthcare, with numerous applications and 
intelligent systems achieving clinical level expertise. However, building 
robust and generalizable systems relies on training algorithms in a cen- 
tralized fashion using large, heterogeneous datasets. In medicine, these 
datasets are time consuming to annotate and difficult to collect centrally 
due to privacy concerns. Recently, Federated Learning has been proposed 
as a distributed learning technique to alleviate many of these privacy con- 
cerns by providing a decentralized training paradigm for models using 
large, distributed data. This new approach has become the defacto way of 
building machine learning models in multiple industries (e.g. edge com- 
puting, smartphones). Due to its strong potential, Federated Learning is 
also becoming a popular training method in healthcare, where patient 
privacy is of paramount concern. In this paper we performed an extensive 
literature review to identify state-of-the-art Federated Learning applica- 
tions for cancer research and clinical oncology analysis. Our objective 
is to provide readers with an overview of the evolving Federated Learn- 
ing landscape, with a focus on applications and algorithms in oncology 
space. Moreover, we hope that this review will help readers to identify 
potential needs and future directions for research and development. 
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Highlights 


e Federated learning (FL) has the potential to become the primary learning 
paradigm for distributed cancer research, but specific hurdles have slowed its 
adoption in the clinical setting. 

e Labeled medical data is still extremely scarce; this problem also affects feder- 
ated learning. A plethora of cancer datasets exist (e.g. TCIA, TCGA, Gene 
Expression Omnibus, etc.), but few of them are labeled for supervised learn- 
ing. The ones that are labeled (i.e., the Wisconsin Breast Cancer dataset - 
for classification, the BraTS dataset - for image segmentation, the Kaggle 
datasets for skin cancer) are the ones most commonly seen being used in FL. 

e The largest majority of papers we found use cancer datasets for benchmarking 
purposes: very few federated learning works solve an actual clinically relevant 
question. Many of the papers we reviewed propose new software frameworks, 
and virtually none follow-up with a clinical trial. This leaves FL absent from 
the field of clinical oncology, based on our literature review. 

e The compliance and security aspect of healthcare still poses the largest hur- 
dle. Commercial entities such as EHR vendors (e.g., Epic Systems, Cerner, 
Meditech, Allscripts, etc.), PACS vendors (e.g., GE, Philips, Hitachi, Siemens, 
Canon, etc.), and other hardware manufacturers (e.g., Nvidia, Intel, etc.) seem 
to be the best positioned to start pulling together resources, data, and models 
that use FL to improve patient outcomes. 


1 Introduction 


Over the past decade, machine learning has witnessed rapid growth due to the 
proliferation of deep learning. Fueled by large-scale training databases [1], these 
data driven methods have gained significant popularity. Thanks to rapidly evolv- 
ing architectures, (e.g., AlexNet [2], GoogLeNet [3], ResNet [4]) convolutional 
neural networks (CNNs) have demonstrated consistent improvement on difficult 
computer vision tasks including classification, object detection, and segmenta- 
tion. Other areas of machine learning, such as natural language understand- 
ing, recommendation systems and speech recognition, have also seen outstand- 
ing results in their respective applications through the introduction of novel 
approaches such as transformers [5,6], DLRM [7] and RNN-T [8]. 

Such advancements in artificial intelligence and machine learning have 
already disrupted and transformed healthcare through applications ranging from 
medical image analysis to protein sequencing [9-12]. And yet, while there are 
over 150 Al-based interventions that are approved by the FDA (an updated 
list with a focus on radiology can be reviewed at https://aicentral.acrdsi.org), 
many open questions persist about how to best deploy existing AI solutions in 
healthcare environments [13]. In addition to getting existing solutions deployed, 
there are many challenges that must be overcome during the training process. 
A consistent bottleneck has been the need for large amounts of heterogeneous 
data to train accurate, robust and generalizable machine learning models. How- 
ever, most healthcare organizations rarely carry data in such large quantities, 
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especially in the case of homogeneous populations or rare diseases with scarce 
amounts of cases. 

A common way data scientists attempt to overcome this issue is by first 
pre-training a model on large, generic datasets (e.g., ImageNet [1]), and then 
fine-tuning them on specific medical tasks of interest. However, even with this 
approach, underperformance or generalizability issues [14] may persist. This is 
often the case for medical tasks where there exists a large domain shift between 
medical data (e.g., brain MRI, abdomen CT, genomics) and general purpose 
public datasets such as ImageNet [1], MIMIC-CXR [15], ChexPert [16], etc. 
More recently, Self Supervised Learning (SSL) approaches have demonstrated 
promising results in performance using large unlabelled datasets, thus alleviating 
the need for annotations; however, even with such SSL approaches, the need for 
access to large amounts of heterogeneous medical data is still necessary to train 
robust medical ML algorithms [17, 18]. 

In addition to large, heterogeneous datasets, the other most common bot- 
tleneck for ML algorithm training is computational power. The need for access 
to considerably efficient computing resources (e.g., processing power, memory, 
storage space) led to the field of distributed systems [19]. Within this area, 
distributed machine learning has evolved as a setting where algorithms are 
implemented and run on multiple nodes, leveraging larger amounts of data and 
computational resources, thus improving performance and efficiency. The core 
concept of distributed learning lies in the parallelization of algorithms across 
computational nodes [19], but these processes are run without considering any 
constraints that might need to be imposed by these nodes (e.g., considering that 
data used across these nodes comes from different distributions). Because of 
this, the majority of practical applications in collaborative learning fail to keep 
the assumption of Independent-and-Identically-Distributed (IID) data across 
nodes, such as user data from mobile devices or healthcare data from differ- 
ent geographic and demographic properties. Federated Learning emerged as a 
distributed learning paradigm that takes into account several practical chal- 
lenges, and differentiates itself from traditional distributed learning settings, as 
noted by Google [20], by addressing four main themes: statistical heterogeneity 
of data across nodes, data imbalance across nodes, limited communication in 
the distributed network (e.g., loss of synchronization, variability of communica- 
tion capabilities), and the possibility of a large number of nodes relative to the 
amounts of data. 

In the Federated Learning setting, a “federation” of client sites with their 
own datasets train models locally and then send their updates to a server. The 
weights are the only information passed over lines of communication aiming at 
preserving privacy. The model weights are then aggregated in the server from 
the client updates, and the resulting aggregated model weights are sent back 
to the clients for the next round of training. Because of its strong potential to 
preserve privacy with client sites, such as hospitals, by keeping their data in- 
house, Federated Learning has seen a rise in popularity over the last several 
years, especially in the medical domain. 
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Specifically, large-scale projects have been developed for facilitating collabo- 
ration of medical institutions around the globe with the aid of Federated Learn- 
ing, in both academic and industrial areas [21]. Trustworthy Federated Data 
Analytics [22], German Cancer Consortium’s Joint Imaging Platform [23], and 
the Melloddy project [24] were developed to improve academic research in various 
healthcare applications by combining multiple institutions’ efforts. In industry, 
the HealthChain project [25] aims to develop and deploy a Federated Learn- 
ing framework across four hospitals in France to help determine effective treat- 
ments for melanoma and breast cancer patients. Additionally, the Federated 
Tumour Segmentation initiative (FeTS) [26,27] is an international collaboration 
between 30 healthcare institutions aimed at enhancing tumor boundary detec- 
tion, for example, in breast and liver tumors. In another international effort [28], 
researchers trained ML models for mammogram assessment across a federation 
of US and Brazilian healthcare providers. 

In light of all these efforts, and given the growing adoption of Federated 
Learning in healthcare, we believe that the cancer research community is lacking 
a much needed review of the current state-of-the-art. Therefore, with this review 
we aim at providing an comprehensive list of Federated Learning algorithms, 
applications and frameworks proposed for cancer analysis. We envision that this 
review can function as a quick reference for Federated Learning’s applications in 
cancer and oncology, and provide a motivation for research in specific directions. 


The review is structured as follows. In Sect.2 we give an overview of Fed- 
erated Learning to introduce the reader to related concepts. The main body of 
this review is found in Sect.3, which we begin by providing the search query 
along with the inclusion/exclusion criteria for papers. After this, we provide a 
summary of the current literature for: 1) Federated Learning algorithms in can- 
cer analysis, 2) Federated Learning frameworks developed for cancer research, 
and 3) Algorithms developed to preserve privacy under Federated Learning set- 
tings. Finally, we conclude this review by offering our thoughts on the needs and 
potential future directions for Federated Learning in the cancer research and 
clinical oncology space. 


2 Federated Learning Overview 


Federated Learning was first introduced as a decentralized distributed machine 
learning paradigm by Google [20]. The standard Federated Learning paradigm 
that is outlined in this paper is as follows: i) Multiple client sites, each containing 
a local dataset that remains at the client site during the entirety of training, 
connect to a global server; ii) A global model is initialized in the global server, 
and the weights from this global model are passed to each of the local client sites; 
iii) Each client site trains a local version of the global model on their respective 
dataset, and then sends the updated model weights to the global server; iv) The 
global server updates the global model by aggregating the weights it receives 
from the local clients, and then passes a copy of the updated global model to 
each of the clients. The process that occurs between steps i-iv is called a round, 
and during federated training, steps i-iv are repeated for multiple rounds until 
the global model converges to a local minima. The most important aspect of 
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this process is step iii. During this step, all data used for training is kept strictly 
on the local clients’ machines. The only information that is passed between the 
clients and the server are weight updates. This enables multiple sites to pool 
their data for training of a global model while still maintaining data privacy. 
During step iv, the authors use an algorithm that they coin federated averaging 
to aggregate the weights. In this algorithm, each weight updated is weighted 
by the size of the client dataset from which it comes, relative to the size of the 
other client datasets. The aforementioned clients-server topology is known as 
Centralized Federated Learning. One other topology has been found in research 
[29], Decentralized Federated Learning, in which clients communicate peer-to- 
peer without a central server. 

Federated Learning can be broken down into three main subtypes [30]: Hor- 
izontal Federated Learning, Vertical Federated Learning, and Federated Trans- 
fer Learning. All three of these subtypes follow the core Federated Learning 
paradigm, which is decentralized data pooling through the use of weight sharing 
and aggregation between multiple clients and a global server. They are distin- 
guished by the way in which their data sources differ. In Horizontal Federated 
Learning, every client site has different users in their data, but all of these users 
share similar features that are extracted by the networks. In Vertical Federated 
Learning, users are the same across all client sites, but each client sites’ data 
consists of different features, so the same user will be analyzed through dif- 
ferent modalities depending on the client site. In Transfer Federated Learning, 
the client sites don’t have users or features in common, but the tasks in their 
datasets are still marginally related, so pooling them together typically leads to 
more robust network training. For a more general review of Federated Learning, 
readers are referred to [29,31,32]. Here we also list common Federated Learning 
platforms: OpenFL [33], PySyft', Tensorflow-Federated?, FedML [34], Flower?, 
NVIDIA Clarat, Personal Health Train (PHT”). 


3 Review 


3.1 Search Design 


The literature review was conducted in October 2021 by searching Google 
Scholar for papers published between 2019 and 2021 that matched the query: 
federated AND (cancer OR cancers OR tumor OR tumors OR oncology). 

We chose this time period for our search query due to the fact that Google 
didn’t publish their seminal Federated Learning paper [35] until 2017, so we 
didn’t see a large amount of medical applications until than. A visual represen- 
tation of the split of the material reviewed is presented in Fig. 1 and our review 
process is shown in Fig. 2. 


1 https://github.com/OpenMined/PySyft. 

? https: //medium.com/tensorflow /introducing-tensorflow-federated-a4147aa20041. 
3 https: //flower.dev/. 

t https: //developer.nvidia.com/clara. 

5 https: //pht.health-ri.nl/. 
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Through our review process we identified two main categories of Federated 
Learning applications related to cancer and oncology: whether the study was 
designed exclusively with cancer as its intended use-case, or whether cancer 
datasets were used for benchmarking a general method (Fig. 1-Category). Every 


100% 


ramework,| 
90% 13% 
80% 
Privacy, 
70% 31% 
60% 
50% 
40% 
Cancer 
30% Analysis, 
20% 3% 
10% 
0% 
Category Sub-Category 
100% 
Others, 8% k xp > 
90% 
2 Survival, 8%| Multiple, 
80% 10% 
Segmentat > 
70% ion, 23% Tabular, Skin, 
31% 16% 
60% 

9, Brain, 
s0% 18% 
40% 

Classificat 
30% ion, 61% 
20% 
10% 
0% 


Task Data Type Cancer Type 


Fig. 1. Split of the papers reviewed: Category and Sub-Category represent the 
paper scope. Task refers to the machine learning task, while Data Type and Cancer 
Type relate to the FL input data. 
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Fig. 2. A visual representation of our process for including papers for this review. 
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category is also further divided into three sub-categories: the first one contains 
the Federated Learning feasibility studies and methods that have been applied 
to the analysis of cancer datasets (i.e., Framework’ in Fig. 1-Sub-Category). 
The second contains Federated Learning frameworks proposed or developed for 
’Cancer Analysis’, although almost all fail to secure relevant and novel cancer 
datasets and hence resort to open-access data. Finally, the third sub-category 
contains Federated Learning studies that address and analyze ’Privacy’ of cancer 
data and computation. 


3.2 Federated Learning Algorithms 


Algorithms Designed for Cancer: Based on our literature search we iden- 
tified that Federated Learning has been explored in many cancer studies, where 
the aim is either comparing Federated Learning to conventional centralized data 
analysis approaches in terms of performance, or developing novel methods to 
solve various challenges faced when using Federated Learning (e.g., domain shift, 
label deficiency, ...). In the most common training scenario, researchers simulate 
a Federated Learning environment by taking an existing dataset and dividing it 
into subsets using a partitioning scheme, where each subset represents a client 
in a Federated Learning group. 

Federated Learning has been applied on detecting brain tumors in several 
studies [36-39]. In [36], the authors used the "Brain MRI Segmentation’ dataset 
from Kaggle for low-grade glioma segmentation [40], dividing the dataset into 
5 “client” sites. The authors designed a network that achieves state-of-the-art 
results on the task of glioma segmentation, and those results remained consistent 
when they applied it to a Federated Learning setting. In [37], two separate 
Federated Learning environments for brain tumor segmentation were simulated 
using the BraTS dataset [41]. In both environments, the Federated Learning 
model was compared against two other collaborative learning techniques, and 
outperformed both. It also achieved nearly 99% of the DICE score obtained by 
a model trained on the entire dataset with no decentralization. Similarly, [38] 
demonstrated comparable performance between federated averaging and data 
sharing for brain tumor segmentation on the BraTS dataset [41]. Sheller et al. 
also showed how Federated Learning improves the learning of each participating 
institution both in terms of performance on local data and performance on data 
from unseen domains. In [39], the authors presented a comparison between a 
Federated Learning approach and individual training of a 3D-Unet model to 
segment glioblastoma in 165 multi-parametric structural MRI (mpMRI) scans. 
The Federated Learning approach is shown to yield superior quantitative results. 

Additional studies have explored Federated Learning on a variety of other can- 
cers, including less common types. Some of the types covered in the uses cases 
we reviewed included: skin cancer [42,43], breast cancer [44,45], prostate can- 
cer [46], lung cancer [47], pancreatic cancer, anal cancer, and thyroid cancer. [42] 
used the ISIC 2018 dataset [48] to simulate a Federated Learning environment for 
classifying skin lesions. They first partitioned the dataset among multiple mock 
client sites, then used a Dual-GAN [49] to augment each clients’ dataset. A clas- 
sifier was then trained in a federated environment on the augmented datasets. 
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In [43], the authors use the ISIC 2019 Dermoscopy dataset [48] to demonstrate 
proof-of-concept for a skin lesion detection device trained using federated learn- 
ing. In Roth et al. [44], a real-world experiment of federated breast density classi- 
fication was performed using NVIDIA’s Clara framework. The authors developed 
a breast density classification model with mammography data from 7 different 
institutions. The global federated model showed significant improvements over the 
locally trained models when validated against their own data as well as external 
site validation. In [50] and [45], the authors demonstrate the ability to successfully 
apply vertical federated learning (VFL) to cancer analysis, using VFL to create 
a survival prediction model for breast cancer. [46] performed prostate image seg- 
mentation in a federated setting. They showed how Federated Learning improves 
model performance on local datasets. [47] described a large experiment on 20K 
lung cancer patients across 8 institutes and 5 countries. They trained a logistic 
regressor on these distributed data. To train the LR coefficients in a distributed 
manner they used the Alternating Direction Method of Multipliers (ADMM). The 
data included tumor staging and post-treatment survival information. 

In [51], the authors tackle the task of pancreas segmentation for patients with 
pancreatic cancer. Advanced tools to correctly identify pancreatic cancer are 
extremely important since pancreatic cancer is normally only detectable once it 
is late-stage, leading to extremely low survival rates [52]. They used two datasets 
obtained from hospitals in Japan and Taiwan to simulate a Federated Learning 
environment. The resulting model was able to better identify pancreas from 
both datasets than models trained only on one site and validated on the other. 
Concluding with similar results, [53] tested several deep learning architectures 
for federated thyroid images classification, and Choudhury et al. [54] used data 
from 3 different sites to create a prediction model for patients with anal cancer, 
an extremely rare form of cancer, who received radical chemoradiotherapy. The 
large and diverse group of examples given here demonstrates the robustness and 
versatility of the Federated Learning paradigm, as well as its ability to improve 
automated analysis on more rare cancer cases [51,53,54]. 

In addition to having many use cases with specific cancer types, Federated 
Learning’s applications in genomics have also been a popular focal point for 
research [55,56]. [55] performed federated gene expression analysis on breast 
cancer and skin cancer data. [56] adapted the Cox proportional hazards (PH) 
model [57] in a Federated Learning setting for survival analysis. Noting that 
adapting this method in a distributed manner is non-trivial due to its non- 
separable loss function, they implemented a discrete time extension of this model 
with a separable loss function, and validated their method on the Genome Atlas 
Data (TCGA)°, showing comparable performance to the centralized approach. 

While the bulk of the papers we’ve reviewed so far focus purely on design- 
ing federated algorithms that can predict different aspects of cancer with high 
degrees of accuracy, a large sub-group of the papers in our review also aim at 
addressing challenges federated learning currently faces. For many papers, that 
challenge is either data heterogeneity [58-65], a common barrier in the medi- 


6 https://www.cancer.gov/tcga. 
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cal field where patients can be subject to different geographic and demographic 
conditions, or label deficiency [66,67], where it is not always guaranteed that 
clients’ sites will have access to labeled data. 

Addressing label deficiency, [66] introduced a new Federated Semi-Supervised 
Learning (FSSL) approach for skin lesion classification. Their method is inspired 
by knowledge distillation [68], where they model disease relationships in each client 
by a relation matrix calculated from the local model output, then aggregate the 
relation matrices from all clients to form a global one that is used locally in each 
round to ensure that clients will have similar disease relationships. In [67], the 
authors proposed a semi-supervised Federated Learning method, FedPerl. The 
method was inspired by peer learning from educational psychology and ensemble 
averaging from committee machines and aims to gain extra knowledge by learning 
from similar clients i.e. peers. This encouraged the self-confidence of the clients by 
sharing their knowledge in a way that did not expose their identities. Experimen- 
tal setup consisted of 71,000 skin lesion images collected from 5 publicly available 
datasets. With little annotated data, FedPerl outperformed state-of-the-art FSSL 
methods and the baselines by 1.8% and 15.8%, respectively. It also generalized 
better to an unseen client while being less sensitive to noisy ones. 

Another challenge that frequently occurs in Federated Learning is domain 
shift, which is caused by heterogeneity in datasets due to different scanners and 
image acquisition protocols at different sites. Many papers modify the original 
FL algorithm to account for this. Jimenez et al. [58] designed a novel weight 
aggregation algorithm designed to address the problem of domain shift between 
data from different institutions. This study utilized one public and two private 
datasets, and the final global model outperformed previous Federated Learn- 
ing approaches. Similarly, [59] introduced a new weight aggregation strategy 
and showed its efficiency on pancreas CT image segmentation. [60] built on 
the work of [51] by developing a Federated Learning algorithm that can learn 
multiple tasks from heterogeneous datasets, making use of a training paradigm 
the authors call dynamic weight averaging (DWA). Specifically, they trained 
a model on the binary-classification problem of segmenting the pancreas from 
background as well the multi-label classification problem of segmenting healthy 
and tumorous pancreatic tissue and background. During the global aggregation 
step, the weight value for each client update was adjusted based on the variation 
of loss values from the previous rounds. DWA outperforms federated averag- 
ing (FedAvg) and FedProx [69], another federated weight aggregation scheme 
designed to handle heterogeneous networks. 

In Guo et al. [61], the authors addressed the problem of domain shift while 
applying their algorithm to the task of MRI reconstruction, using 4 different 
MRI datasets; FastMRI, BraTS, IXI, and HPKs. Their algorithm, Federated 
Learning-based Magnetic Resonance Imaging Reconstruction with Cross-site 
Modeling (FL-MRCM), uses an adversarial domain identifier to align latent fea- 
tures taken from the encoders of 2 different sites, avoiding sharing of data while 
taking advantage of multiple sites’ data. In all experiments, FL-MRCM came 
closest to reaching the upper-bound score of training a network on the entire 
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dataset. In the same space, to alleviate domain shift performance impact, [62] 
proposed a new method to train deep learning algorithms in Federated Learning 
settings based on the disentanglement of the latent space into shape and appear- 
ance information. Their method only shared the shape parameters to mitigate 
domain shifts between individual clients. They presented promising results on 
multiple brain MRI datasets. 

Researchers in [63] proposed a method to address domain shift issues in terms 
of performance and stability based on sharing the parameters of batch normal- 
ization across clients but keeping the batch norm statistics local. Given that 
these statistics are not shared with the central server they argued that there 
is better protection from privacy attacks. They demonstrated their algorithm 
on breast histopathology image analysis (Camelyon 20167 and Camelyon 20178 
datasets). In [64] a key-problem of digital pathology is addressed via federated 
learning: stain normalization across multiple laboratories and sites. They apply 
GANs in a Federated Learning environment to solve the problem of color nor- 
malization that arises due to different staining techniques used at different sites. 
Here, a central discriminator is trained to be extremely robust by making use of 
several decentralized generators. 

Domain shift in Federated Learning has been also studied in Neural Archi- 
tecture Search (NAS). [65] applied AutoML, a NAS approach, in a federated 
setting for prostate image segmentation. To address domain shift, they trained 
a ’supernet’ consisting of several deep learning modules in a federated setting, 
then personalize this supernet in each client by searching for the best path along 
the supernet components according to each client. 


General Algorithms Benchmarked on Cancer Datasets: Cancer datasets 
are also commonly used as benchmarks for evaluating general Federated Learning 
approaches. BraTS [41], HAM10000 [70], Wisconsin Breast Cancer dataset [71], 
and TCGA?® were the most common datasets used in the papers we sourced for 
this review. 

The BraTS dataset is an imaging dataset used to train computer vision mod- 
els for brain tumor segmentation. It is frequently used as a benchmark for state- 
of-the-art image analysis algorithms. Chang et al. [72] performed a Federated 
Learning experiment on BraTS [41] using GANs in a similar setting to [64]. They 
use several decentralized discriminators, placed at mock client sites, to train a 
centralized discriminator at the client. Receiving synthetic images from a large 
amount of generators allowed the authors to augment the dataset in a decen- 
tralized fashion and train the discriminator to achieve very high accuracy. In 
some cases the classifier was able to outperform non-Federated Learning trained 
models, using Area Under the Curve (AUC) as a performance metric. In [73], 
the authors address the problem of domain shift while benchmarking on BraTS. 
They partition the network, and place a copy of each partition at each client 
site. They then place the rest of the network on a centralized server. Lower-level 
features taken from each client site are aggregated and passed as input to the 
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central network, which learns to be robust against domain shift. This paradigm 
leads to extremely strong training results, especially as the domain shift becomes 
more pronounced. 

The HAM10000 dataset is a multi-source dermatoscopic image dataset of 
pigmented lesion used for skin lesion detection and segmentation. Similar to 
BraTS, it frequently appears in many computer vision applications, such as [74], 
where the authors proposed a new server aggregation method addressing sta- 
tistical heterogeneity that may be present between the participating datasets. 
The weights are calculated to be inversely proportional to the difference between 
the corresponding client model distribution and the global model distribution. 
They validated their new method on several benchmarks, including HAM10000 
[70]. In [75] a new Federated Learning strategy was introduced for tackling non 
iid-ness in data. Training one epoch on each local dataset was done over sev- 
eral communication rounds. The approach was evaluated on various datasets, 
including HAM10000, and showed superior results to similar methods, such as 
Fed AVG. 

The Wisconsin Breast Cancer dataset [71] is another versatile dataset that 
is used for benchmarking many different classification algorithms. It is a simple 
dataset that is easy to integrate into most ML workflows, consisting of positive 
and negative breast cancer samples, and several numerical features describing 
those samples. Salmeron et al. [76] used this dataset to simulate a Federated 
Learning environment. The authors then used this environment to train a Fuzzy 
Cognitive Map (FCM) [77] classifier that outperformed clients that were trained 
individually as well as a model trained on the entire dataset. Researchers in 
[78] extended SQL-based training data debugging (RAIN method) for Federated 
Learning. They demonstrated this extension on multiple datasets, including the 
Wisconsin Breast Cancer dataset [71]. [79] introduced a new Federated Learning 
strategy that showed comparable performance to federated averaging while giv- 
ing two benefits: communication efficiency and trustworthiness, via Stein Varia- 
tional Gradient Descent (SVGD) which is a non-parametric Bayesian framework 
that approximates a target posterior distribution via non-random and interact- 
ing particles. They performed extensive experiments on various benchmarks, 
including binary classification of breast cancer data. [80] introduced a new fed- 
erated setup that requires less communication costs and no centralized model 
sharing; clients learn collaboratively and simultaneously without the need of syn- 
chronization. They validated their setup, termed gradient assisted learning, on 
various datasets including breast cancer, and showed comparable performance 
with state-of-the-art methods but with less communications costs. [81] investi- 
gated how to mitigate the effects of model poisoning, a scenario where one or 
more clients upload intentionally false model parameters (or are forced to do 
so, e.g. by being hacked). They introduced new model-poisoning attacks, and 
showed that the methods of mitigating the effects of these attacks still need 
development. In [82], a method for building a global model under the Federated 
Learning setting was proposed by learning the data distribution of each client 
and building a global model based on these shared distributions. 
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The Cancer Genome Atlas (TCGA) is a public consortium of cancer data cre- 
ated for the purpose of benchmarking healthcare analysis algorithms. In [83] a 
method was proposed for matrix factorization under Federated Learning settings. 
Specifically, they extended the FedAvg method to allow for robust matrix factor- 
ization. They benchmarked this method on the Cancer Genome Atlas (TCGA). 
Benchmarking on the same data, [84] introduced two Federated Learning algo- 
rithms for matrix factorization and applied them to a data clustering task. 


3.3 Federated Learning Frameworks 


Frameworks Developed for Cancer Analysis: In [85], the authors designed 
a decentralized framework which they coined Braintorrent. This framework 
removes the global server from the traditional FL paradigm, and instead allows 
sites to communicate their weights with one another directly. The framework was 
tested on the task of whole-brain segmentation, and demonstrates impressive 
results, outperforming traditional Federated Learning with a global server and 
achieving performance close to that of a model trained using pooled data. [86] 
designed an open source framework to facilitate analysis of local data between 
institutions in order to create a model for oral cavity cancer survival rates using 
data from multinational institutions. [87] introduced a framework, GenoPPML, 
that is a combination of Federated Learning and multiparty computation. The 
framework utilizes differential privacy and homomorphic encryption for guaran- 
teeing preserved privacy, and it was mainly built for regression for genomics data. 
In [88] the authors proposed a framework to train on skin lesion images using 
IoT devices (smartphones). They further utilized Transfer Learning in this Fed- 
erated Learning framework to circumvent the need of large, labelled data. The 
German National Cancer Center, an initiative whose primary goal is to foster 
multiclinical trials for development of improved diagnosis and treatment tools 
for cancer, recently released the Joint Imaging Platform (JIP) [89], a platform 
designed to build a foundation for Federated Learning scenarios. JIP provides 
containerized tools for Federated Learning, and many institutions have com- 
mitted to testing JIP for use cases in the coming years. [90] provides another 
framework with multiple objectives and use cases. Here, the authors proposed a 
“marketplace” approach to federated learning: it provides the infrastructure and 
other computational resources for 3rd party applications to run in a Secure Mul- 
tiparty Computation system; there, for sake of example, multiple computational 
tasks related to cancer research (from data normalization to Kaplan-Mayer anal- 
ysis and COX regression) are treated as “Apps” and deployed into a secure and 
distributed environment. 


General Frameworks: Because decentralized analysis of medical data is one of 
the most natural use cases for federated learning, cancer datasets are frequently 
included when benchmarking general federated learning frameworks. [91] intro- 
duced a framework for federated meta learning; a library for fast and efficient 
algorithm selection. They evaluated a prototype on various datasets including 
breast cancer dataset, showing better efficiency of their framework in finding the 
best algorithm for a given dataset against the ordinary grid search approach. In 
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[92], the authors design a classification framework for breast cancer that incor- 
porates differential privacy. Similarly, [50] uses the Wisconsin Breast Dataset as 
once of their use cases for a privacy-verification FL framework. 


3.4 Privacy Protection in Federated Learning Settings 


One important benefit of Federated Learning for healthcare is its potential to 
mitigate privacy concerns. Although Federated Learning allows for multiple sites 
to train ML models on their data safely, there are still ways that this paradigm 
can be exploited. One very common exploitation is that dataset labels can be 
reconstructed from the gradients used during model training [93,94]. 

In this section we discuss research that addresses privacy concerns of Feder- 
ated Learning in cancer. We present papers that either benchmark their privacy- 
concerned investigations and methods on cancer data, or those which study Fed- 
erated Learning privacy exclusively for cancer applications. 


Privacy Methods for Cancer: In [95], the authors proposed a combination 
of meta-heuristic methods to operate the whole mechanism of aggregation, sepa- 
ration of models as well as evaluation. They analyzed the results in terms of the 
accuracy of the general model as well as for security against poisoning attacks. 
[96] implemented differentially privacy SGD training in a cyclic Federated Learn- 
ing setting of two clients, and did an extensive study on the trade-off between 
privacy and accuracy. They achieved an acceptable trade-off between accuracy 
and privacy, and tested their experiments on classification of tumorous genes. 
In [97] the authors benchmarked various differential privacy methods against 
skin lesion classification in Federated Learning settings. [98] demonstrated an 
approach to prevent access to intermediate model weights by using a layer for 
privacy protection. The aggregation server prevented direct connections between 
hosts so that interim model weights cannot be viewed during training. 

In [99], the authors studied the effect that two different techniques to preserve 
privacy had on a Federated Learning environment: injecting samples with noise 
or sharing only a fraction of the model’s weights. Using the BraTS dataset [41] for 
brain tumor segmentation, they found that leaving out up to 40% of the model’s 
weights only affected accuracy by a negligible amount. Using the BraTS dataset 
[41] the authors in [100] extended Private Aggregation of Teacher Ensembles 
(PATE) [101] which is used as an aggregation function using the teacher-student 
paradigm to enable privacy preserving training: teacher models are trained on 
private datasets and the student model (global) is trained on a public dataset 
using those teacher models. This extension applied a dimensionality reduction 
method to increase sensitivity for segmentation tasks. They validated their app- 
roach on three (2) common dimensionality reduction methods to assess differen- 
tial privacy: PCA, Autoencoder and Wavelet transforms. [102] used noise injec- 
tion as a successful privacy preservation technique for analyzing gigapixel whole 
slide images. [103] created a hybrid environment for encryption of medical data 
using blockchain technologies, Federated Learning, and homomorphic encryp- 
tion. Homomorphic encryption is also used in [104], where it is leveraged to show 
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secure and accurate computation of essential biomedical analysis tasks, including 
Kaplan-Meier survival analysis in oncology and genome-wide association studies 
(GWAS) in medical genetics. The authors demonstrate this through the use of 
their framework, FAMHE. GWAS data was also at the center of the SAFETY 
framework [105], where a hybrid deployment of both homomorphic encryption 
and secure hardware (Intel SGX) provides a good trade-off in terms of efficiency 
and computational support for secure statistical analysis. Rrajotte et al. [106] 
developed a framework called FELICIA (Federated Learning with a Centralized 
Adversary), which uses the PrivGAN architecture [107] to make use of data from 
multiple institutions and create higher-quality synthetic training data without 
sharing data among sites. [108] used differential privacy and demonstrated how 
the performance was still comparable to the centralized experiments despite the 
privacy-performance trade-off. They also showed empirically how the model with 
differential privacy became immune against adversarial attacks, and evaluated 
all their approaches on liver image segmentation. 


General Privacy-Preserving Methods Benchmarked on Cancer Data 
sets: [109] introduced Federboost, a Federated Learning method for gradient 
boosting decision trees (GDBT). Their method can be applied for vertical and 
horizontal Federated Learning, and is characterized by the ease of ensuring secure 
model sharing. They demonstrated security and comparable performance to cen- 
tralized settings using various datasets including breast cancer gene data from 
TCGA. [110] introduced a new Federated Learning approach for mitigating pos- 
sible privacy breaches when sharing model weights. Their method was evaluated 
on various benchmark datasets including breast cancer data, and showed com- 
parable performance to the conventional Federated Learning approaches while 
being more robust to gradient leaks, i.e. more privacy-preserving. [111] devel- 
oped a homomorphic encryption framework on FPGA, aiming to accelerate the 
training phase under Federated Learning with the most possible encryption. 
They demonstrated performance improvement in speed benchmarking on mul- 
tiple datasets including the Wisconsin Breast Cancer dataset. 

In [112], the authors proposed attacks for two machine learning algorithms, 
logistic regression and XGBoost, in a Federated Learning setting. In this study 
the adversary does not deviate from the defined learning protocol, but attempts 
to infer private training data from the legitimately received information. In [113], 
the authors proposed an approach, self-taught Federated Learning, to address 
the limitations of current methods when handling heterogeneous datasets (e.g. a 
slow training speed, impractical for real-world applications). It exploited unsu- 
pervised feature extraction techniques for Federated Learning with heteroge- 
neous datasets while preserving data privacy. In [114] a method is proposed to 
identify malicious poisoning attacks by having the server itself bootstrap trust. 
Specifically, the server collects a small, clean training dataset (called the root 
dataset) for the learning task and maintains a model (called server model) based 
on this to bootstrap trust. In each iteration, the server first assigns a trust score 
to each local model update from the clients, where a local model update has 
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a lower trust score. They benchmarked their method against CH-MNIST; a 
medical image classification dataset consisting of 5,000 images of histology tiles 
collected from colorectal cancer patients. Where privacy is concerned, quantum 
cryptography is probably the next frontier of the security battleground, and some 
authors have started developing in this direction while using cancer datasets for 
benchmarking their secure federated learning frameworks [115]. Figure 1 presents 
an overall synopsis of all the studies reviewed in this paper based on AI tasks, 
cancer type, data type and category of work. 


4 Conclusion and Discussion 


Data decentralization is a crucial setting for developing data-driven models in 
healthcare due to the sensitive nature of medical data. Federated Learning, while 
still a new research field, has already demonstrated its potential use to support 
a distributed learning setup for healthcare. While the general field of Federated 
Learning research is very active with a focus on improving model aggregation 
and efficient communication between nodes, model and data privacy is a very 
challenging and open problem [32]. The data privacy aspect is very important 
especially in healthcare where legal, ethical and regulatory constraints impose 
tremendous restrictions and pressure to data providers (e.g., healthcare net- 
works, research institutions) 

While the Federated Learning research community is engaged in addressing 
the aforementioned open problems, in this paper we aimed at presenting the cur- 
rent status of Federated Learning in the domain of cancer and oncology because 
we believe that the machine learning community in this particular space can 
benefit from a quick review and perhaps direct research efforts in specific subar- 
eas. Our review highlighted that although a lot of works have been developed for 
Federated Learning only 56% of them have been exclusively proposed for cancer 
research or clinical oncology. This demonstrates the need for solutions designed 
specifically within this space. For example, privacy preserving methods may need 
to be researched and explored under the scope of the cancer field given that pri- 
vacy requirements and guarantees can be significantly different from other areas 
(e.g., finance). In a similar fashion, while data heterogeneity is an open chal- 
lenge in the general machine learning community, cancer and oncology datasets 
manifest unique properties which may require deeper clinical and medical device 
expertise involvement when developing methods that aim at overcoming model 
degredation in largely heterogeneous medical data. 

Although there are quite a few frameworks developed specifically for cancer 
analysis (i.e., 13% Fig. 1), there is the potential risk of a fragmented platform 
landscape. This is true when it comes to the general Federated Learning commu- 
nity in which a large number of frameworks are currently being developed and 
maintained. Indeed, such efforts can lead to improved solutions but it is usually 
collaborative efforts that can achieve better adoption. In the cancer domain data 
scientists can benefit from platforms that aim at developing tools for distributed 
annotation, distributed model training workflows, and moreover the adoption of 
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data standardization and thus better integration of Federated Learning into the 
clinical workflow. 

When it comes to tasks (Fig. 1) we observed that the majority of algorithms 
are related to classification and segmentation, and use images (either from radi- 
ology or pathology) as input data type. This highlights the need for a broader 
exploration of other important tasks in cancer analysis such as survival predic- 
tion, genomics expression, precision medicine, patient treatment planning, and 
advanced patient diagnosis/prognosis through multi-modal data. Furthermore, 
within the context of cancer type we identified that almost 70% of the stud- 
ies were addressing only a specific type of cancer: either brain tumor, or breast 
cancer, or skin lesions. This reaffirms our previous statement that Federated 
Learning should expand its application on multiple cancer types. Perhaps the 
reason for this increased focus on these three specific cancer types comes from 
the fact that these three areas have been well-established through the release of 
large public datasets. This emphasizes the overall need for large medical datasets 
being available to the research community. Ideally, federations that are currently 
being developed to support distributed learning (e.g., Federated Learning) will 
provide support in the future for secure remote machine learning development 
on geographically distributed data providers through robust privacy-preserving 
layers. 

As with any new research field, Federated Learning for healthcare and in 
particular for cancer and oncology is still in its early days. However, whether the 
studies were simulating Federated Learning environments or conducting small 
experiments across hospitals with real private data, they constitute solid basis 
for future work. Federated Learning infrastructures are continuously being devel- 
oped specifically for healthcare and cancer research to facilitate true collabora- 
tion between healthcare institutions across the world. 
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Abstract. In recent years, deep learning techniques have shown poten- 
tial for incorporation in many facets of the medical imaging pipeline, 
from image acquisition/reconstruction to segmentation/classification to 
outcome prediction. Specifically, these models can help improve the effi- 
ciency and accuracy of image interpretation and quantification. However, 
it is important to note the challenges of working with medical imaging 
data, and how this can affect the effectiveness of the algorithms when 
deployed. In this review, we first present an overview of the medical 
imaging pipeline and some of the areas where deep learning has been 
used to improve upon the current standard of care for brain lesions. We 
conclude with a section on some of the current challenges and hurdles 
facing neuroimaging researchers. 


Keywords: Deep learning - Imaging - Neuro-oncology 


1 Introduction 


The advent of noninvasive imaging technologies such as magnetic resonance 
imaging (MRI) and computed tomography (CT) has revolutionized medicine, 
enabling clinicians to make informed decisions for diagnosis, surgical planning, 
and treatment response assessment. In recent years, access to larger and more 
comprehensive repositories of patient imaging data along with advances in com- 
putational resources has closed the gap between machine and human. Specif- 
ically, artificial intelligence (AI) based algorithms can now interpret imaging 
scans at the level of expert clinicians. 

While the majority of current research is focused on the interpretation of 
medical imaging, upstream aspects of the imaging pipeline are primed to be 
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 
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improved via AI as well. Briefly, the imaging pipeline can be broken into three 
steps: 1) acquisition/reconstruction, 2) analysis, and 3) interpretation (Fig. 1). 
The first step in the pipeline is image acquisition, wherein raw data that is 
not visually interpretable by a human is gathered. This raw data must then be 
reconstructed into an anatomical image. For example, when performing an MRI, 
data is acquired at specific frequency bands in the Fourier domain and is then 
reconstructed into the spatial domain for human interpretation. The next step is 
image analysis, wherein both qualitative and quantitative information regarding 
the pathology of interest is gleaned. Finally, the last step is image interpretation, 
wherein a trained clinician makes judgments regarding tasks such as diagnosis 
or treatment planning. For instance, given a tumor’s volume and location in the 
brain, a clinician may decide to utilize radiation in lieu of surgery. This general 
workflow is shown in Fig. 1. 

Even though imaging has been used in clinical practice for many decades, 
problems still persist that hamper its efficacy. For example, patient motion dur- 
ing image acquisition may render a scan unreadable since most reconstruction 
algorithms are incapable of correcting for motion blur. Even when a scan is per- 
fectly acquired, the complete manual analysis may be too time-consuming to be 
feasible, resulting in metrics such as the response assessment in neuro-oncology 
(RANO) criteria [49] to be used as a proxy measure for full volumetric tumor 
burden. In the following sections, we will discuss some of the problems that 
arise in the standard imaging pipeline and the opportunities that exist to uti- 
lize advanced deep learning techniques to improve the efficiency of each of these 
steps. 


Image Acquisition and Reconstruction 


Image Analysis Image Interpretation 


Fig.1. The imaging pipeline is made up of three main components: 1) acquisi- 
tion/reconstruction, 2) analysis, and 3) interpretation. Image acquisition and recon- 
struction entails converting sensor domain data into the spatial domain. Image 
enhancement/super-resolution can either be done in parallel with reconstruction, or 
as a separate step. Image analysis for brain lesions includes anatomical and tumor 
segmentations, along with automatic RANO measures. Finally, image interpretation 
includes survival prediction, tumor histopathologic grading, and radiogenomic correla- 
tions, among other applications. 
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2 Opportunities in Image Acquisition and Reconstruction 


The first step in the imaging pipeline is acquisition and reconstruction. When 
an image is acquired, it is encoded into an intermediate representation of the 
image target known as the sensor domain. For this intermediate representa- 
tion to lead to an image, the function or encoding method used to encode the 
image into the sensor domain must be inverted in a process known as recon- 
struction. Image reconstruction is required for many kinds of medical imaging, 
including MRI, CT, and positron emission tomography. Existing approaches for 
reconstruction are incomplete since noisy, real-world data often precludes knowl- 
edge of an exact inverse transform. To overcome the problems with conventional 
image reconstruction methods, researchers have in recent years begun testing 
deep learning-based approaches. 

One example of a unified framework for deep learning-based image recon- 
struction is Automated Transform by Manifold Approximation (AUTOMAP) 
[54]. AUTOMAP is implemented with a deep neural network architecture com- 
posed of fully connected layers followed by convolutional layers. Zhu et al. gen- 
erated training data by taking a large set of images from a natural scene and 
inverse encoding them into the sensor domain with the desired encoding function 
to create a paired dataset. The network was then trained in a supervised learning 
manner, enabling the network to learn the optimal strategies for image recon- 
struction. The trained neural network was then applied to MRI images of the 
human brain. Surprisingly, they found that training on images of objects such 
as animals and plants (rather than MRI of the brain) still allowed for accurate 
reconstruction of brain MRI images for three of the four commonly used encoding 
schemes they tested, which implies the robustness of their approach. Moreover, 
AUTOMAP implicitly learned how to denoise imaging, removing common arti- 
facts such as zipper artifacts that would have persisted if the image had been 
reconstructed by conventional methods. When tested against simulated data 
using known ground truth, AUTOMAP reconstructed images were thus more 
accurate and had a higher SNR. The study opened opportunities for adopting 
deep learning approaches for image reconstruction of a wide range of different 
imaging modalities without having to learn complex, modality-specific physics. 

Another groundbreaking reconstruction model for accelerated MRI is the 
Variational Network (VN) [21]. One of the biggest concerns about using learning- 
based reconstruction methods in the clinical workflow was that they may not 
preserve pathology-related features that are rare or specific to certain patients. 
For efficient and accurate reconstruction of MRI data, they proposed a trainable 
formulation for accelerated parallel imaging-based MRI reconstruction inspired 
by variational methods and deep learning. VN incorporates key concepts from 
compressed sensing, formulated as a variational model within a deep learning 
approach. This approach is designed to learn a complete reconstruction proce- 
dure for complex multi-channel MRI data, including all free parameters that need 
to be established empirically. Hammernik et al. train the model on a complete 
clinical protocol for musculoskeletal imaging, evaluate its performance on vari- 
ous accelerating factors, and train on both normal and pseudo-random Cartesian 
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2D sampling. Using clinical patient data, they investigated the ability of the VN 
approach to preserve unique pathologies not included in the training dataset. 
Surprisingly, it was able to preserve important features not present in the train- 
ing data, outperforming conventional reconstructions for a range of pathologies 
while providing unprecedented reconstruction speeds. 


3 Opportunities in Image Analysis 


The second step in the imaging pipeline is analysis. Here, information neces- 
sary for downstream tasks is either manually or automatically extracted. Med- 
ical image analysis covers a wide span of topics, including but not limited to 
anatomical segmentation and volumetric quantification, extraction of parameter 
maps from diffusion/perfusion imaging, and groupwise population analyses. In 
this section, we will specifically look at examples involving brain tumor segmen- 
tation. 

Primary and metastatic brain tumors account for nearly 200,000 new cases 
in the US every year, and imaging plays a crucial role in optimizing patient care 
[43,48]. Segmentation of tumor boundaries is a necessary component for suc- 
cessful surgical and radiotherapy treatment planning [14]. Unfortunately, tumor 
segmentation is a challenging task requiring substantial domain expertise. Fur- 
thermore, as many studies have shown, motion artifacts, field inhomogeneities, 
and differences in imaging protocols both within and across medical institutions 
lead to non-negligible amounts of human error as well as significant amounts of 
intra- and inter-rater variability [31]. 

To combat these issues, researchers have turned to deep learning as it has 
the potential to produce accurate and reproducible results many orders of mag- 
nitude faster than can be accomplished manually. The shift to trainable AI is 
being further encouraged by the release of open-source datasets with high-quality 
annotations such as that from the Multimodal Brain Tumor Segmentation Chal- 
lenge (BraTS) [4,6-8,33]. 

Variations of 3D U-Nets [46] have provided state-of-the-art results for seg- 
mentation of primary brain tumors. For example, Myronenko won the 2018 
BRATS challenge utilizing an asymmetrical residual U-Net, where most of the 
trainable parameters of the model resided in the encoder. Furthermore, in con- 
trast to the standard U-Net framework which uses four or five downsampling 
operations in the encoder, he applied only three in order to preserve spatial con- 
text [36]. Other modifications to the U-Net structure have also been used with 
success. Jiang et al. won the 2019 challenge using a two-stage cascaded asym- 
metrical residual U-Net, where the second stage of their cascade was used to 
refine the coarse segmentation maps generated by the first stage [27]. The sec- 
ond place that year was awarded to Zhao et al., who utilized dense blocks along 
with various optimization strategies such as variable patch/batch size training, 
heuristic sampling, and semi-supervised learning [52]. It is important to note that 
while architectural modifications to the U-Net can provide performance boosts, 
they are not always necessary. Indeed, Isensee et al. won the 2020 challenge 
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with their architecture coined “No New-Net”, highlighting that a vanilla U- 
Net coupled with excellent training and optimization strategies can still achieve 
state-of-the-art results. Moreover, they achieved an average testing set dice score 
of 88.95% for whole tumor segmentation, achieving segmentation performance 
indistinguishable from human experts [25]. 

Similar strategies have been shown to work for metastatic brain tumors, 
which present additional hurdles compared to primary brain tumors. Patients 
with metastases often present with more than one target lesion along with micro- 
metastases spread systemically across the brain parenchyma. Micro-metastases 
are particularly challenging to segment due to their size and limited contrast 
enhancement. Various approaches have been proposed, from two-stage detec- 
tion/segmentation pipelines to modifications of the loss function. While these 
approaches have yielded some improvement in performance, much work is still 
needed. For example, Zhou et al. developed a two-stage pipeline consisting of a 
detection stage followed by a segmentation stage. While they reported an excel- 
lent dice score of 87% on large metastases (>6 mm), their results dropped to 
just 17% for micro-metastases (<3 mm) [53]. This trend is seen in other studies 
as well [11,47], indicating the strong need for better segmentation algorithms for 
brain metastases cases. 

Longitudinal measurement of lesion burden is the basis for treatment 
response assessment. While volumetric measurement would be the ideal metric 
for lesion burden, the aforementioned issues with manual tumor segmentation 
necessitate the use of proxy measures such as RANO. RANO for gliomas is 
defined as the product of the maximum bidimensional diameters of the largest 
axial cross-section of the tumor on MRI [49]. Even this metric is subject to inter- 
rater variability, since different raters may choose differing slices based on their 
subjective assessment of which axial slice has the largest tumor area. To auto- 
mate this process, Chang et al. developed a tool called AutoRANO which used 
the outputs of a segmentation model capable of running on post-operative imag- 
ing to derive RANO measurements. He noted that AutoRANO had a higher 
correlation with manual contrast-enhancing volume than did manual RANO 
measures performed by expert radiologists, suggesting that AutoRANO may be 
a more accurate measure of tumor burden than manual RANO [14]. Similar work 
has been done to automate bi-directional measurements for other tumor types, 
with equally promising results [40]. 


4 Opportunities in Image Interpretation 


The final step in the imaging pipeline is interpretation. From a machine learning 
standpoint, this is often framed as a classification problem. For example, with 
regards to brain tumors, image classification tasks include but are not limited 
to identifying subtypes, predicting pseudo-progression versus true progression, 
ascertaining tumor malignancy status, and identifying treatment responders. 
Indeed, two key facets in which the rise of AI has been particularly exciting 
include radiogenomics and survival prediction. 
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Radiogenomics refers to the correlation between imaging features and specific 
gene expression patterns/molecular profiles of tumors. Such approaches have 
mainly been studied for primary gliomas, but interest is accruing to replicate 
such studies for brain metastases and spinal cord tumors as well. The ability to 
predict molecular marker status noninvasively is important since a priori knowl- 
edge of the mutational status of key genes together with radiographic suspicion 
of a neoplasm might favor early intervention and/or mutation-specific thera- 
peutic interventions. In the case of gliomas, the MGMT gene, which codes for 
an enzyme responsible for DNA repair following alkylating agent chemother- 
apy, may be silenced by methylation of its promoter during tumor development, 
thereby preventing repair of DNA damage. This increases the potential effec- 
tiveness of alkylating agent chemotherapy for these patients [23]. In order to 
demonstrate that a deep learning model could predict MGMT methylation sta- 
tus from imaging without the need for explicitly providing a tumor segmentation, 
Korfiatis et al. [30] trained three deep residual neural networks of varying sizes on 
a training dataset of 110 patients with T2-weighted MRI, artificially increasing 
the size of this dataset by splitting all 3D imaging into 2D axial slices. Here, the 
authors found that deeper, more parametrized networks produce better results, 
with their ResNet50 model achieving an accuracy of 94.9% on the test set (45 
patients with 2612 slices). Another key gene conferring longer survival in glioma 
patients is IDH, which in its wild-type form codes for an enzyme responsible for 
the conversion of isocitrate to a-ketoglutarate in the Krebs cycle. Gliomas har- 
boring the IDH1/2 mutation carry a significantly increased overall survival than 
the corresponding wild type [12]. Chang et al. [12] used a similar methodology 
as Korfiatis et al. [30] for the prediction of IDH status, utilizing a residual neural 
network with 2D inputs. In this case, the network required a predefined tumor 
segmentation, since it was trained on cropped tumor images only. The authors 
performed exceptional multi-institutional evaluation, acquiring data from three 
different sites, and reporting a final accuracy and AUC on a testing set of 147 
patients of 87.6% and 0.95, respectively. Similarly, Akkus et al. [2] focused on 
the prediction of 1p19q co-deletion, a highly prognostic molecular marker asso- 
ciated with longer survival in low-grade glioma (LGG) patients. With only 387 
slices in the training data, the authors noted extreme overfitting, initially see- 
ing perfect training sensitivity, specificity, and accuracy. To mitigate this, they 
made use of data augmentation techniques such as random translations, rota- 
tions, and flips, resulting in an increased final test set accuracy from 63.3% to 
87.7%. Additionally, Chang et al. [16] aimed to integrate prediction of MGMT 
methylation status, IDH mutation status, and 1p19q codeletion into a single 
residual network. After five-fold cross-validation on their dataset of 259 patients 
(5259 slices), they achieved mean accuracy of 83%, 94%, and 92%, respectively, 
on the three tasks. Finally, MGMT methylation status prediction from MRI was 
a key component of the BraTs 2021 challenge, in which many teams utilized 
machine learning techniques for non-invasive assessment. 

Survival analysis is a technique employed in cohort and other longitudinal 
studies to predict the time it takes for a particular event to occur. In these stud- 
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ies, individuals are followed from an initial observation (e.g. study enrollment, 
time of diagnosis/treatment) until the occurrence of a subsequent event (e.g., 
death, disease, relapse) or until follow-up is no longer possible. Depending on 
what event is used, the time between the two is denoted as progression-free sur- 
vival or overall survival (OS) [39]. Survival analyses of brain tumors have utilized 
both radiomics based approaches and deep learning, as well as an integration 
of the two. Ujjwal et al. [5] proposed a three-step framework for OS predic- 
tion which involved segmentation, radiomic feature extraction, and a survival 
prediction model to stratify patients into three survival groups (short-, mid-, 
and long-term survivors) and to predict OS. This approach achieved accuracy 
scores of 0.571 and 0.558 on validation and testing cohorts of 53 and 130 cases 
respectively. Finally, Han et al. [22] incorporated both hand-crafted radiomics 
features and deep features generated by a pretrained CNN on a dataset of 178 
high-grade glioma patients (50 local, and 128 from TCGA), applying feature 
selection and Elastic Net-Cox modeling to classify patients into short- and long- 
term survivors. This combined feature analysis framework resulted in a log-rank 
test p-value of <0.001 for the 50 patient local cohort, and a corresponding value 
of 0.014 for the 128 patient TCGA cohort. 


5 Challenges 


As mentioned in the previous sections, there are significant opportunities to 
improve clinical decision-making and patient management using AI. However, 
it is important to keep in mind certain caveats and challenges to developing 
effective deep learning models for healthcare applications. First, it is important 
to acknowledge the brittleness of deep learning models, or in other words, the lack 
of generalizability across different acquisition settings and patient populations 
[15,18]. For example, different hospitals may have MRI scanners with different 
field strengths or use different scanning protocols. Different hospitals may also 
admit patients of different age groups or racial backgrounds. These institutional 
differences are further exacerbated by the fact that many medical datasets are 
small, either due to rare pathology, costly human annotations, or simply due to 
difficulty in extracting data from antiquated electronic medical record systems. 
Indeed, empirical studies have shown that there is a drop in the performance of 
deep learning models for brain lesions when evaluated at institutions different 
from the ones in which they were trained [3,44]. One approach to handle the issue 
of generalizability is to accumulate large quantities of diverse, multi-institutional 
patient data. However, logistical issues, as well as patient privacy concerns may 
render this impractical. Another approach involves fine-tuning the existing model 
on a small quantity of new data when there is dataset shift [44]. More generally, 
continuous learning methods allow models to be “living” and to be refined as the 
data changes [42]. Other approaches include methods to adapt either the data 
or the model itself to be able to handle new domains with approaches under the 
umbrella of domain adaptation [28,51]. If large quantities of data are available, 
but not shareable between institutions, distributed learning approaches can be 
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used to train models without the need to share patient data, overcoming patient 
privacy barriers [13,45]. 

Another major challenge facing trainable AI models is the lack of definitive 
ground truth. For example, for the segmentation of brain lesions, there is often 
subjectivity involved in determining tumor boundaries, especially for lesions 
that are diffusely edematous. Similarly, the boundaries of contrast enhancement 
may be ambiguous as well due to the presence of necrotic regions. This sub- 
jectivity is primarily due to the spatial resolution limitations of MRI, which 
makes categorizing tumor components into discrete bins of necrosis, enhanc- 
ing, or edema difficult. Thus, it is unsurprising that there is significant intra- 
and inter-rater variability for neuroimaging related segmentation [10,17,34]. In 
the case of radiogenomic prediction using ground truth from a single biopsy 
site, there is also uncertainty stemming from regional intra-lesional genetic het- 
erogeneity of tumors [37,38,41,50]. This is further compounded by multi-focal 
lesions, which can also display genetic heterogeneity across lesions from the same 
patient [1]. For other prediction tasks, such as prognostic assessment, there may 
be significant confounders that are not incorporated into the inputs, such as 
degree of resection and chemotherapeutic regimen. Taken together, the clinical 
utility and efficacy of machine learning models may be limited if there is no 
way to handle uncertainty within the data. One way to potentially mitigate this 
problem is to utilize deep learning methods that can estimate uncertainty to 
provide multiple possible outputs, mimicking variability by different clinicians 
[29]. Another viable approach is to train networks to directly report a measure 
of uncertainty, thus allowing clinicians to stratify network outputs by the degree 
of confidence [24,32]. This would enable flagging of highly uncertain cases for 
further manual expert review. 

A final challenge that should be mentioned is the reproducibility of deep 
learning studies for neuroimaging. With the rapid pace of advances within the 
field, new research often builds upon previous work to yield improvements in 
performance. However, without the release of code, much effort would need to 
be devoted to reproducing previously published results for further evaluation and 
development [20]. As such, there has been a growing trend towards the release of 
open-source frameworks for medical AI to allow for greater collaboration within 
the research and clinical communities [9,19,26]. On a similar front, the public 
release of code is increasingly becoming the expectation for publication [35]. 
However, this is not without potential concerns of its own, since it may result in 
the accidental leaking of protected patient health information or may deter the 
commercialization of research. 


6 Conclusion 


Significant progress has been made in the last few years to automate and increase 
the efficiency of all steps in the imaging pipeline via the use of deep learning. 
Specifically, greater accessibility to large-scale multi-institutional datasets and 
better computational resources together have led to advances in image recon- 
struction, analysis, and interpretation. Our review has highlighted some of the 
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exciting AI research being performed at each of these steps in the imaging 
pipeline, and some challenges and pitfalls that all researchers working with neu- 
roimaging data must acknowledge. 
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Abstract. In this paper, we propose a novel network named Efficient 
Multi Scale Vision Transformer for Biomedical Image Segmentation 
(EMSViT). Our network splits the input feature maps into three parts 
with 1x1, 3x3 and 5x5 convolutions in both encoder and decoder. Con- 
cat operator is used to merge the features before being fed to three con- 
secutive transformer blocks with attention mechanism embedded inside 
it. Skip connections are used to connect encoder and decoder transformer 
blocks. Similarly, transformer blocks and multi scale architecture is used 
in decoder before being linearly projected to produce the output seg- 
mentation map. We test the performance of our network using Synapse 
multi-organ segmentation dataset, Automated cardiac diagnosis chal- 
lenge dataset, Brain tumour MRI segmentation dataset and Spleen CT 
segmentation dataset. Without bells and whistles, our network outper- 
forms most of the previous state of the art CNN and transformer based 
models using Dice score and the Hausdorff distance as the evaluation 
metrics. 


1 Introduction 


Deep Convolutional Neural Networks has been highly successful in medical image 
segmentation. U-Net (Ronneberger et al. 2015) based architectures use a sym- 
metric encoder-decoder network with skip-connections. The limitation of CNN- 
based approach is that it is unable to model long-range relation, due to the 
regional locality of convolution operations. To tackle this problem, self atten- 
tion mechanism was proposed (Schlemper et al. 2019) and (Wang et al. 2018). 
Still, the problem of capturing multi-scale contextual information was not solved 
which leads not so accurate segmentation of structures with variable shapes and 
scales (e.g. brain lesions with different sizes). An alternative technique using 
Transformers are better suited at modeling global contextual information. 
Vision Transformer (ViT) (Dosovitskiy et al. 2020) splits the image into 
patches and models the correlation between these patches as sequences with 
Transformer, achieving better speed-performance trade-off on image classifica- 
tion than previous state of the art image recognition methods. DeiT (Touvron 
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et al. 2020) proposed a knowledge distillation method for training Vision Trans- 
formers. An extensive study was done by Bakas et al. (2018) to find the best 
algorithm for segmenting tumours in brain. Medical images from CT and MRI 
are in 3 dimensions, thus making volumetric segmentation important. Cicek 
et al. (2016) tackled this problem using 3d U-Net. Densely-connected volumetric 
convnets was used (Yu et al. 2017) to segment cardiovascular images. A com- 
prehensive study to evaluate segmentation performance using Dice score and 
Jaccard index was done by (Eelbode et al. 2020). 


2 Related Work 


2.1 Convolutional Neural Network 


Earlier work for medical image segmentation used some variants of the origi- 
nal U-shaped architecture (Ronneberger et al. 2015). Some of these were Res- 
UNet (Xiao et al. 2018), Dense-UNet (Li et al. 2018) and U-Net++ (Zhou et al. 
2018). These architectures are quite successful for various kind of problems in 
the domain of medical image segmentation. 


2.2 Attention Mechanism 


Self Attention mechanism (Wang et al. 2018) has been used successfully to 
improve the performance of the network. Schlemper et al. (2019) used skip con- 
nections with additive attention gate in U-shaped architecture to perform med- 
ical image segmentation. Attention mechanism was first used in U-Net (Oktay 
et al. 2018) for medical image segmentation. A multi-scale attention network 
(Fan et al. 2020) was proposed in the context of biomedical image segmenta- 
tion. Jin et al. (2020) used a hybrid deep attention-aware network to extract 
liver and tumor in CT scans. Attention module was added to U-Net module to 
exploit full resolution features for medical image segmentation (Li et al. 2020). 
A similar work using attention based CNN was done by Liu et al. (2020) in the 
context of schemic stroke disease. A multi scale self guided attention network 
was used to achieve state of the art results (Sinha and Dolz 2020) for medical 
image segmentation. 


2.3 Transformers 


Transformer first proposed by Vaswani et al. (2017) have achieved state of the 
art performance on various tasks. Inspired by it, Vision Transformer (Dosovit- 
skiy et al. 2020) was proposed which achieved better speed-accuracy tradeoff 
for image recognition. To improve this, Swin Transformer (Liu et al. 2021) was 
proposed which outperformed previous networks on various vision tasks includ- 
ing image classification, object detection and semantic segmentation. (Chen et 
al. 2021), (Valanarasu et al. 2021) and (Hatamizadeh et al. 2021) individually 
proposed methods to integrate CNN and transformers into a single network for 
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medical image segmentation. Transformer along with CNN are applied in multi- 
modal brain tumor segmentation (Wang et al. 2021) and 3D medical image 
segmentation (Xie et al. 2021). 

Our main contributions can be summarized as: 


e We propose a novel network incorporating attention mechanism in trans- 
former architecture along with multi scale module name EMSViT in the con- 
text of medical image segmentation. 

e Our network outperforms previous state of the art CNN based as well as 
transformer based architectures on various datasets. 

e We present the ablation study showing our network performance is general- 
izable hence can be incorporated to tackle other similar problems. 


2.4 Background 


Suppose an image is given x € RĦXWXC with a spatial resolution of H x W 


and C number of channels. The goal is to predict the pixel-wise label of size 
H x W for each image. We start by performing tokenization by reshaping the 
input x into a sequence of flattened 2D patches T € R(i = 1,.., N), where each 
patch is of size P x P and N = (H x W)/P? is the number of patches present 
in the image. We convert the vectorized patches xp into a latent D-dimensional 
embedding space using a linear projection vector. We use patch embeddings to 
make sure the positional information is present as shown below: 


Zo = [x E; x?E; vee ;xp' 5] + Enos (1) 


where E € R&P “G x D denotes the patch embedding projection, and Epos € 
RN*P denotes the position embedding. 

After the embedding layer, we use multi scale context block followed by a 
stack of transformer blocks (Dosovitskiy et al. 2020) made up of multiheaded 
self-attention (MSA) and multilayer perceptron (MLP) layers as shown in Eq. 2 
and Eq. 3 respectively: 


Zi, = MSA (Norm (Zi-1)) + Zi-1 (2) 


z; = MLP (Norm (z/)) + zi (3) 
where Norm represents layer normalization, MLP is made up of two linear lay- 
ers and i is the individual block. A MSA block is made up of n self-attention 
(SA) heads in parallel. The structure of Transformer layer used in this work is 
illustrated in Fig. 1: 


3 Method 


3.1 Dataset 


1. Synapse multi-organ segmentation dataset - We use 30 abdominal CT 
scans in the MICCAI 2015 Multi-Atlas Abdomen Labeling Challenge, with 3779 
axial contrast-enhanced abdominal clinical CT images in total. 
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2. Brain Tumor Segmentation dataset - 3D MRI dataset used in the 


experiments is provided by the BraTS 2019 challenge (Menze et al. 2014) and 
(Bakas et al. 2018). 


Embedded 
Sequence 


Layer Norm 


AH 


Layer Norm 


SH 


Fig. 1. Schematic of the transformer layer used in this work. 


3.2 Network Architecture 


The output sequence of Transformer zz, € RÎXN is first reshaped to d x H/8 x 
W/8 x D/8 . A convolution block is used to reduce the channel dimension from 
d to K. This helps in reducing the computational complexity. Upsampling oper- 
ations and successive convolution blocks are the used to get back a full reso- 
lution segmentation result R € RĦ*WX*D., Skip-connections are used to fuse 
the encoder features with the decoder by concatenation to get more contextual 
information. In the encoder part, the input image is split into patches and fed 
into linear embedding layer. The feature map is splitted into N parts along with 
the channel dimension. The individual features are fused before being passed to 
the transformer blocks. The decoder block is comprised of transformer blocks 
followed by a similar split and concat operator. Linear projection is used on 
the feature maps to produce the segmentation map. Skip connections are used 
between the encoder and decoder transformer blocks to provide an alternative 
path for the gradient to flow thus speeding up the training process. 

Two different types of convolutional operations are applied to the encoder 
features Fen to generate the feature maps F, € Rı and Fy € R°*"*” respec- 
tively. Subsequently, F is reshaped into the matrixes of feature maps Fı and 
Fə. Then, a matrix multiplication operation with softmax normalization is per- 
formed in the permuted version of M and N, resulting in the position attention 
map B € R(h x w) x (h x w), which can be defined as: 
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exp (M; - Nj) 


Bij = T 
i Dizi exp (Mi - Nj) 


(4) 


where B; ; measures the impact of it position on jt position and n = h x w 
is the number of pixels. After that, W is multiplied by the permuted version of 
B, and the resulting feature at each position can be formulated as: 


GSA(M,N,W); =X_ (Bi ;W)) (5) 
i=1 
Similarly, we reshape the resulting features to generate the final output of 
our vision transformer. 


3.3 Residual Connection 


The input feature maps of each decoder block are up-sampled to the resolution of 
outputs through bilinear interpolation, and then concatenated with the output 
feature maps as the inputs of the subsequent block, which is defined as: 


Fn = fn (Fea) P Un (Fn-1) (6) 


The detailed architecture of our network as well as the intermediate skip- 
connections is shown in Fig. 2: 
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Fig. 2. Overview of our model architecture. Output sizes demonstrated for patch 
dimension N = 16 and embedding size C = 768. We extract sequence representa- 
tions of different layers in the transformer and merge them with the decoder using skip 
connections. 
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Similar to the previous works (Hu et al. 2019), self-attention is computed as 
defined below: 


MSA (Q, K, V) = Sof tMaz (25 + B) V (7) 


where Q, K,V € RM?’xd denote the query, key and value matrices. M? and 
d denotes the number of patches in a window and the dimension of the 


query. The values in B are taken from the random bias matrix denoted by 
Be RE@M-1)x(2M+1) 


The output of MSA is defined below: 


TMSA(z) = [MSAj(z); MSAa(z);...; MSAn(z)] Wimsa (8) 


where Wimsa represents the learnable weight matrices of different heads (SA). 


3.4 Loss Function 


Commonly used Binary Cross Entropy and Dice Loss terms are used for training 
our network as defined in Eq. 9 and Eq. 10 respectively: 


t 
Lace = >) (yilog (p:) + (1 — ys) log (1 — p:)) (9) 
i=1 
a YiPi + € 
Divi t pire 
where ¢ is the total number of pixels in each image, y; represents the ground- 


truth value of the it” pixel, p; the confidence score of the it” pixel in prediction 
results. The above two loss functions can be combined to give: 


Lnice = 1 (10) 


Liotat = Lace + LDice (11) 


The complete loss function is a combination of dice and cross entropy terms 
which is calculated in voxel-wise manner as defined below: 


J 
Leotal =1 a 


2 yn, a 
iat 4 8 GiglogYig (12) 
I I oJ J 
J j=l Xi- G3; Fna Y2 I i=1 j=1 


where J is the number of voxels, J is the number of classes, Y;,; and G;,; denote 
the probability output and one-hot encoded ground truth for voxel 7 of class j. 
In our experiment, a = 2 = 0.5, and e = 0.0001. 


3.5 Evaluation Metrics 


The segmentation accuracy is measured by the Dice score and the Hausdorff 
distance (95%) metrics for enhancing tumor region (ET), regions of the tumor 
core (TC), and the whole tumor region (WT). 
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3.6 Implementation Details 


Our model is trained using Pytorch deep learning framework. The learning rate 
and weight decay values used are 0.00015 and 0.005, respectively. We use batch 
size value of 16 and ADAM optimizer to train our model. We use a random crop 
of 128 x 192 x 192 and mean normalization to prepare our model input. The 
input image size and patch size are set as 224 x 224 and 4, respectively. As a 
model input, we use the 3D voxel by cropping the brain region. The following 
data augmentation techniques are applied: 


1. Random cropping of the data from 240 x 240 x 155 to 128 x 128 x 128 voxels; 
2. Flipping across the axial, coronal and sagittal planes by a probability of 0.5 
3. Random Intensity shift between [—0.05, 0.05] and scale between [0.5, 1.0]. 


4 Results 


We report the average DSC and average Hausdorff Distance (HD) on 8 abdom- 
inal organs (aorta, gallbladder, spleen, left kidney, right kidney, liver, pancreas, 
spleen, stomach) with a random split of 20 samples in training set and 10 sample 
for validation set using Synapse multi-organ CT dataset in Table 1. Our network 
clearly outperforms previous state of the art CNN as well as transformer net- 
works. 


Table 1. Comparison on the Synapse multi-organ CT dataset (average dice score %, 
average Hausdorff distance in mm, and dice score % for each organ). The best results 
are highlighted in bold. 


Encoder Decoder DSC HD Aorta GB Kid (L) | Kid (R) | Liver | Panc | Spleen Stomach 
V-Net V-Net 68.81 |- 75.34 | 51.87 |77.10 80.75 |87.84 | 40.05 | 80.56 56.98 
DARR DARR 69.77 |- 74.74 | 53.77 | 72.31 | 73.24 | 94.08 | 54.18 89.90 45.96 
R50 U-Net 74.68 | 36.87 | 84.18 | 62.84 |79.19 | 71.29 | 93.35 | 48.23 | 84.41 73.92 
R50 AttnUNet | 75.57 | 36.97 55.92 | 63.91 |79.20  |72.71 | 93.56 | 49.37 | 87.19 74.95 
EMSViT None 61.50 | 39.61 | 44.38 | 39.59 |67.46 | 62.94 |89.21 | 43.14 | 75.45 69.78 
EMSViT CUP 67.86 | 36.11 | 70.19 | 45.10 | 74.70 | 67.40 | 91.32 | 42.00 |81.75 70.44 
R50-EMSViT | CUP 71.29 | 32.87 | 73.73 | 55.13 | 75.80 | 72.20 | 91.51 | 45.99 | 81.99 73.95 
TransUNet | TransUNet | 77.48 | 31.69 | 87.23 63.13 | 81.87 | 77.02 | 94.08 | 55.86 | 85.08 75.62 
SwinUnet SwinUnet | 79.13 21.55 | 85.47 | 66.53 | 83.28 79.61 | 94.29 | 56.58 90.66 76.60 
EMSViT EMSViT 80.45 | 21.24 | 86.41 | 66.80 | 83.59 | 80.12 | 94.56 56.90 91.28 76.82 


We conduct the five-fold cross-validation evaluation on the BraTS 2019 train- 
ing set. The quantitative results is presented in Table 2. Our network again out- 
performs previous state of the art CNN as well as transformer networks using 
most of the evaluation metrics except Hausdorff distance on ET and WT. 

We compare the performance of our model against CNN based networks for 
the task of brain tumour segmentation in Table3. Again, our network outper- 
forms previous state of the art CNN as well as transformer networks. 
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Table 2. Comparison on the BraTS 2019 validation set. DS represents Dice score and 
HD represents Hausdorff distance. The best results are highlighted in bold. 


Method ET (DS%) | WT (DS%) | TC (DS%) | ET (HD mm) | WT (HD mm) | TC (HD mm) 
3D U-Net 70.86 87.38 72.48 5.062 9.432 8.719 
V-Net 73.89 88.73 76.56 6.131 6.256 8.705 
KiU-Net 73.21 87.60 73.92 6.323 8.942 9.893 
Attention U-Net 75.96 88.81 77.20 5.202 7.756 8.258 
Li et al. 77.10 88.60 81.30 6.033 6.232 7.409 
TransBTS w/o TTA | 78.36 88.89 81.41 5.908 7.599 7.584 
TransBTS w/ TTA | 78.93 90.00 81.94 3.736 5.644 6.049 
EMSViT 79.24 90.28 82.23 3.706 5.621 7.129 


Table 3. Cross validation results of brain tumour segmentation task. DSC1, DSC2 
and DSC3 denote average dice scores for the Whole Tumour (WT), Enhancing Tumour 
(ET) and Tumour Core (TC) across all folds. For each split, average dice score of three 
classes are used. The best results are highlighted in bold. 


Fold Split-1 | Split-2 | Split-3 | Split-4 | Split-5 | DSC1 | DSC2 | DSC3 Avg. 

VNet 64.83 67.28 | 65.23 65.2 66.34 | 75.96 | 54.99 | 66.38 | 65.77 
AHNet 65.78 69.31 |65.16 65.05 | 67.84 | 75.8 | 57.58 | 66.50 | 66.63 
Att-UNet | 66.39 70.18 |65.39 66.11 | 67.29 75.29 |57.11 | 68.81 | 67.07 
UNet 67.20 69.11 |66.84 66.95 | 68.16 | 75.03 | 57.87 | 70.06 | 67.65 
SegResNet | 69.62 | 71.84 | 67.86 | 68.52 | 70.43 | 76.37 | 59.56 | 73.03 | 69.65 
EMSViT | 70.92 73.84 | 71.05 72.29 |72.43 79.52 | 60.90 | 76.11 | 71.98 


In Table 4, We compare the performance of our network against previous state 
of the art for the task of spleen segmentation. Except on Split-4 and Split-5, our 
network outperforms both state of the art CNN and transformer networks. 

The visualization of the validation set prediction is illustrated in Fig. 3: 


Fig. 3. All the four modalities of the brain tumor visualized with the ground-truth and 
predicted segmentation of tumor sub-regions for BraTS 2019 crossvalidation dataset. 
red label: Necrosis, yellow label: Edema and green label: Edema. 
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Table 4. Cross validation results of spleen segmentation task. For each split, we provide 
the average dice score of fore-ground class. The best results are highlighted in bold. 


Fold Split-1 | Split-2 | Split-3 | Split-4 | Split-5 | Avg. 
VNet 94.78 | 92.08 |95.54 |94.73 |95.03 | 94.43 
AHNet 94.23 92.10 |94.56 94.39 |94.11 | 93.87 
Att-UNet |93.16 | 92.59 |95.08 |94.75 |95.81 | 94.27 
UNet 92.83 92.83 |95.76 95.01 | 96.27 | 94.54 
SegResNet | 95.66 | 92.00 |95.79 94.19 | 95.53 | 94.63 
UNETR |95.95 94.01 | 96.37 | 95.89 |96.91 | 95.82 
EMSViT |96.14 94.52 | 96.52 | 95.76 |96.78 96.14 


The segmentation results of our model on the Synapse multi-organ CT 
dataset is shown in Fig. 4: 


Wiona [gallbladder [left kidney M right kidney JM liver I pancreas [spleen stomach 


Fig. 4. The segmentation results of our network on the Synapse multi-organ CT 
dataset. Left depicts ground truth, while the right one depicts predicted segmenta- 
tion from our network. 


4.1 Ablation Studies 


We conduct the experiments of our model with bilinear interpolation and trans- 
posed convolution on Synapse multi-organ CT dataset as shown in Table6. The 
experiment shows that our network using transposed convolution layer achieves 
better segmentation accuracy. 


Table 5. Ablation study on the impact of the up-sampling. Here BI denotes bilinear 
interpolation, TC denotes transposed convolution. The best results are highlighted in 
bold. 


Up-sampling | DSC | Aorta | Gallbladder | Kidney (L) | Kidney (R) | Liver | Pancreas Spleen | Stomach 


82.04 | 67.18 
TC 78.53 | 84.55 68.02 82.46 74.41 (94.59 55.91 | 89.25 | 73.96 
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We explore our network at various model scales (i.e. depth (L) and embedding 
dimension (d)) using BraTS 2019 validation dataset. We show ablation study to 
verify the impact of Transformer scale on the segmentation performance. Our 
network with d = 384 and L = 4 achieves the best scores of ET, WT and 
TC. Increasing the depth and decreasing the embedding dimension gives better 
results. However, the impact of depth on performance is much more than that 
of embedding dimension as shown in Table 8: 


Table 6. Ablation study demonstrating the effect of depth and embedding dimension 
on our vision transformer using BraTS 2019 validation dataset. DS represents Dice 
score. The best results are highlighted in bold. 


Depth (L) | Embedding dim (d) | ET (DS%) | WT (DS%) | TC (DS%) 
1 384 69.24 84.16 70.18 
1 512 69.05 83.87 69.92 
2 384 70.59 84.88 72.51 
2 512 70.13 84.15 71.99 
4 384 72.06 85.39 73.67 
4 512 71.55 85.06 73.05 


Using the set of ablation studies, it can be inferred that the performance of 
our network is generalizable. 


5 Conclusions 


Biomedical image segmentation is a challenging problem in medical imaging. 
Recently deep learning methods leveraging both CNN and transformer based 
architectures have been highly successful in this domain. In this paper, we pro- 
pose a novel network named Efficient Multi Scale Vision Transformer (EMSViT) 
for Biomedical Image Segmentation. We use multi scale mechanism to split the 
features employing different convolutions and concatenating those individual 
feature maps produced before being passed to transformer blocks in encoder. 
The decoder also uses similar mechanism with skip connections connecting the 
encoder and decoder transformer blocks. The output feature map after split and 
concat operator is passed through a linear projection block to produce the out- 
put segmentation map. Using Dice Score and the Hausdorff Distance on multiple 
datasets, our network outperforms most of the previous CNN as well as trans- 
former based architectures. In the future, we would like to use Efficient Multi 
Scale Vision transformer to tackle other problems in computer vision like depth 
estimation. 
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Abstract. Deep neural network methods have led to impressive break- 
throughs in the medical image field. Most of them focus on single-modal 
data, while diagnoses in clinical practice are usually determined based 
on multi-modal data, especially for tumor diseases. In this paper, we 
intend to find a way to effectively fuse radiology images and pathology 
images for the diagnosis of gliomas. To this end, we propose a collabora- 
tive attention network (CA-Net), which consists of three attention-based 
feature fusion modules, multi-instance attention, cross attention, and 
attention fusion. We first take an individual network for each modal- 
ity to extract the original features. Multi-instance attention combines 
different informative patches in the pathology image to form a holistic 
pathology feature. Cross attention interacts between the two modalities 
and enhances single modality features by exploring complementary infor- 
mation from the other modality. The cross attention matrixes imply the 
feature reliability, so they are further utilized to obtain a coefficient for 
each modality to linearly fuse the enhanced features as the final represen- 
tation in the attention fusion module. The three attention modules are 
collaborative to discover a comprehensive representation. Our result on 
the CPM-RadPath outperforms other fusion methods by a large margin, 
which demonstrates the effectiveness of the proposed method. 


Keywords: Multi-modal - Cross attention - Gliomas 


1 Introduction 


Gliomas are the most common primary intracranial tumors, accounting for 40% 
to 50% of all cranial tumors. World Health Organization (WHO) grading sys- 
tem grade the gliomas from 1 (least malignant and best prognosis) to 4 (most 
malignant and worst prognosis). According to the pathological malignancy of 
the tumor cells, brain gliomas are also divided into low-grade gliomas (including 
astrocytoma, oligodendroglioma) and high-grade gliomas (glioblastoma). Mag- 
netic resonance imaging (MRI) is the common examination method for gliomas, 
which is mainly used to identify low-grade gliomas and high-grade gliomas. 
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Due to the limitation of MRI in the identification of astrocytoma and oligo- 
dendroglioma, pathology images are also used. Hence, the diagnosis of gliomas 
in clinical practice is based on multiple modalities of medical images, which 
requires the doctors to have a rich experience. Computer aided diagnosis (CAD) 
systems are in demand to facilitate the diagnosis process. 

Convolutional neural network (CNN) is the most widely used deep learning 
model to learn complex discriminative features of images and various architec- 
tures of CNN have been proposed, such as VGG16 [1], ResNet [2], and Densenet 
[3]. These networks achieve human-level performance on many tasks in the nat- 
ural image field. Moreover, deep learning methods also bring significant progress 
in the medical field. For instance, the U-Net [4] architecture was proposed for the 
segmentation of neuronal structures and performed well on a variety of biomed- 
ical segmentation tasks. However, most models only focus on single modality 
data, such as X-ray images [5], CT images [6], or MRI images [7]. 

In order to obtain more information for better decision, learning methods on 
multi-modal data has been a growing trend. Incorporating visual information 
on many speech tasks has achieved great gains, such as speech enhancement 
[8], speech separation [9,10]. Pretraining on vision and language data quickly 
become a popular task after the advent of BERT [11]. In the medical image field, 
multi-modal data refers to the images taken by different inspection methods and 
non-image data [36]. Although there are some public multi-modal datasets like 
BraTs [12,37-39], CHAOS [13], CPM-RadPath [14,40], the methods of fusing 
the multi-modal data are still deficient. To the best of our knowledge, most fusion 
methods on medical images are limited to direct fusion by concatenating or linear 
weighting at the input-level [15-17], feature-level [18—20,28], or decision-level 
[21-23]. Pandya et al. [24] introduced a multi-channel MRI embedding strategy 
to improve the result of deep learning-based tumor segmentation models. This 
method linearly fused four modalities at the input-level. Neubauer et al. [18] 
improved the performance of tumor delineation by merging the features of MRI 
and PET/CT data after two modality-specific encoders. Kamnitsas et al. [22] 
trained three networks separately and averaged the confidence of each network 
as the final result. 

MRI images and pathology images are the most common inspection meth- 
ods for gliomas diagnoses. CPM-Radpath [14,40] provided both modalities to 
evaluate the performance of computer-aided systems. This task is difficult as the 
two modalities are totally different. MRI images are 3D scanning data of the 
brain, while pathology images are 2D microscopy data of the sliced tissue. Ma 
et al. [25] fused the final results of the two modalities by logistic regression. Xue 
et al. [26] proposed a dual path model and fused the features before the last 
fully connected layer directly. However, due to the great difference between the 
two modalities, the relation between them is quite complicated and it can not 
be captured by these simple fusion methods. In this work, we adopt the power- 
ful modeling capability of the attention mechanism and propose a collaborative 
attention network (CA-Net). It consists of three attention based feature fusion 
modules. Multi-instance attention combines different pathology patch features. 
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Fig. 1. The pipeline of the proposed framework. Features from the pathology image 
and the MRI image are fused by three modules, Multi-Instance Attention (MIA), Cross 
Attention (CA), Attention Fusion (AF) to identify three subtypes of gliomas. 


Cross attention implicitly captures the relation between the two modalities and 
enhances both features by the complementary information from the other modal- 
ity. Attention fusion fuses the two features according to the reliability of each 
feature, which is computed based on the learned cross attention matrixes, and 
obtain the final feature representation. 


2 Method 


Based on pathology images and MRI images, our task is to identify the subtypes 
of gliomas. The pipeline of the proposed CA-Net is shown in Fig. 1, including five 
parts, two feature extractors of pathological images and MRI images, three col- 
laborative attention-based feature fusion modules, i.e. Multi-Instance Attention 
(MIA), Cross Attention (CA), Attention Fusion (AF). 


2.1 Features Extraction 


The resolution of pathological images is around 100000 x 100000, which is too 
huge for computation devices to process. A typical solution is extracting patches 
from the whole slide image. We exclude the white background regions and crop 
patches sized 256 x 256 without overlap. Then we filter out the patches that 
have low entropy. The extracted patches are then fed to a Densenet [3] structure 
network which consists of four stages and the number of dense blocks in each 
stage is 4, 8, 12, and 24. 

The MRI images of each patient contain four types of scans, including T1, 
T2, T1-CE, and Flair. In order to reduce the useless information, extraction of 
the lesion is first performed by a U-Net structured lesion segmentation model 
with 23 layers, which is pre-trained on BraTS2019 [12,37-39]. Lesion regions 
are then cropped and resized to 128 x 128 x 128. The four types of scans are 
concatenated to form a 4D tensor. The feature extractor is a 3D-Densenet [3], 
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Fig. 2. The architecture of the Multi-instance Attention module (MIA). Features from 
different patches are fused by adaptively learned coefficients to form a holistic feature. 


which consists of four stages and the number of dense blocks in each stage is 4, 
8, 12, and 12. 

Both the pathology image and MRI image feature extractors are trained 
with a cross-entropy loss. Since the pathological images are only annotated with 
image labels, we have no label for each patch. Thereby, we directly assign the 
whole image label to the sampled patches, as most studies [27] do. 


2.2 Multi-instance Attention 


There are multiple patches and multiple features in each pathology image, which 
is unbalanced when fusing with the radiology feature. So we should combine the 
features of all the patches to form a holistic feature, which is similar to the 
setting in multi-instance learning (MIL). The extracted patch is regarded as an 
instance and we shall build a bag feature to represent the pathology image. To 
this end, we propose a multi-instance attention module, as illustrated in Fig. 2. 

For the convenience of parallel training, we only sample a fixed number (500 
in this paper) of instances for training and inference. All the sampled instances 
with a feature size of c x 8 x 8 are sent to a global average pooling (GAP) layer, 
result in a feature size of c x 1. c is the channel number. Then the attention 
coefficient is computed by Eq. 1. 


exp(w? tanh(vg;)) (1) 
45 = SM 
doj=1 exp(w? tanh(vg;)) 
gj is the feature of the jth instance after GAP. M is the number of instances. 
w € RY! v © R“*¢ are the parameters of two fully connected layers. Tanh 
is employed as the activation function. The learned attention coefficients are 
further utilized to accumulate all the instances’ features and get the bag-level 
feature. 


2.3 Cross Attention 


Pathology features and radiology features have plenty of complementary infor- 
mation. Previous feature fusion methods including concatenation and linear 
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Fig. 3. The architecture of the cross-attention module. 


fusion can not effectively explore the relation between the two modalities. In 
this work, we propose a cross-attention module to deeply learn their relations, 
which is illustrated in Fig. 3. 

Attention is a popular mechanism in deep learning models, especially after 
the introduction of self-attention [29]. The most frequently used attention is 
scaled dot-product attention, which computes the relation by the dot product of 
the feature vector. The dot-product attention implies that similar features have a 
close relation. However, in our task, the features come from two totally different 
modalities, therefore, it’s not a valid way to adopt the dot-product attention. 
We adopt additive attention [30] to explore the relationship between different 
modalities, which is formulated as follows: 


eij = f (qi, kj), (2) 

ae exp(€i;) 4 

= Xp €xp(eik) (3) 
N 

Ji = e Qijkj (4) 


The pathology feature size is c x 8 x 8 and the radiology feature size is 
cx4x4x 4. Both of them are reshaped to c x 64 before sent to the attention 
module. c is the channel number, i.e. feature length. Attention is computed 
at every position. q; is the query feature from one modality and k; is the key 
feature from the other modality. N is the number of positions (64 in our setting). 
A shared multi-layer perceptron (MLP) followed by a softmax normalization is 
employed to learn their relation. Note that q; and k; are concatenated before 
sent to the MLP, which means e;; will be different when the modality of the 
query feature changes. Then the complementary feature from the other modality 
can be obtained by a simple linearly weighted summation. The complementary 
feature g; is added to the original query feature q; to enhance the feature of each 
modality, obtaining Fp and F». 
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2.4 Attention Fusion 


The last step is to fuse the features from the two modalities. Although the 
enhanced feature of each modality has contained the information of both modal- 
ities, we believe that the representational ability, i.e. reliabilities, of them are 
still different. An easy solution is to learn an adaptive linear coefficient for each 
modality. But this will bring in extra parameters, which will lead to overfitting. 
We notice that the attention matrix in the cross-attention module refers to the 
relation between two modalities. Thereby, we attempt to explore the reliability 
according to the attention matrix. Actually, when e;,; in Eq. 2 is bigger, it means 
the query feature q; is more dependent on the key feature kj, implying that 
the query feature is less reliable. Although the query feature is enhanced by the 
cross attention module, the complementary feature is scaled by a normalized 
coefficient a;; for the sake of stable training. Hence, the enhanced feature still 
does not contain sufficient complementary information. Thus we can infer the 
feature reliability according to e;;. We compute the reliability as in Eq. 5. 


: 6) 
r= 5N VN 
viet j= a(eij) 
g is a measure function, which is sigmoid in this work. The final feature 
representation is obtained by Eq. 6. 


Tpfp +1 rf, 
Tp +Tr 


F= (6) 

Fp and F, are the enhanced pathology feature and radiology feature. rp and 
ry are the corresponding reliabilities calculated by Eq. 5 when taking pathology 
features and radiology features as the query feature, respectively. The higher the 
reliability is, the higher the weight is. 

The final feature representation is sent to the classifier to be classified into 
three subtypes of gliomas. The loss function is cross entropy. The three attention 
based feature modules are jointly trained, while the feature extractors of the two 
modalities are trained independently. 


3 Results 


3.1 Experiment Setup 


Dataset. CPM-RadPath [14,40] consists of 221 paired radiology images and 
histopathology images for training. Since we can not obtain the validation data 
and test data, we only utilized its training data for experiments. Due to the 
limited number of images in medical tasks, all the experiments were evaluated 
by 3-folder cross-validation. The MRI images of each patient contain four types 
of scans, Flair, T1, T1-Ce, and T2. Due to the differences in the staining process 
of slices, pathology images have a big variance in color, we converted the RGB 
pathology images into gray images. CPM-RadPath aims to distinguish between 
three subtypes of brain tumors, namely astrocytoma, oligodendroglioma, and 
glioblastoma. The number of each subtype is shown in Table 1. 
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Table 1. Data distribution of different subtypes in CPM-RadPath. 


Subtype|A |O/}G_ | Total 


Number | 54 | 34 | 133 | 221 
A: astrocytoma, O: oligoden- 
droglioma, G: glioblastoma 


Implementation Details. Feature extractors of pathology images and radiol- 
ogy images were trained with a batch size of 400 and 20 respectively, and the 
number of feature channel was set to 64. Xavier initialization was adopted in 
all the models. Parameters were optimized by SGD [31], and the weight decay 
and momentum were set as le—4 and 0.95 respectively. The learning rate was 
initially set to 0.001 and was divided by 10 at 50% and 75% of the total training 
epochs. All the models were trained based on MXNet [32] for 200 epochs on 
a TeslaV100 GPU. For the pathology images, the same augmentation methods 
as the study [35] were used, including random brightness and contrast, random 
saturation and hue, flip, and rotation. Random crop and flip were adopted as 
data augmentation for the radiology images. 

The feature extractors of the two modalities were first trained with a cross- 
entropy loss. Then we frozen the feature extractors and jointly trained the three 
attention modules. 


3.2 Results of Gliomas Classification 


The same evaluation metrics of the CPM-RadPath challenge [14,40] were 
employed to evaluate the effectiveness of the proposed method in this paper. 


Results on a Single Modality. The dataset consists of pathology images and 
radiology images (MRI). We first evaluated the performance on single modality 
data. Results are displayed in Table 2. Compared with the pathology image, the 
results of the radiology image are much worse. The reason is that astrocytoma 
and oligodendroglioma only have a slight difference in radiology images, so it is 
difficult for models to learn a discriminative feature. And that is also why we 
need pathology images in this task. 

When evaluated on the pathology images, we compared our multi-instance 
attention with another common feature fusion method, max-out [33]. Max-out 
selects the biggest value among all the extracted patches as the output for each 
feature element. We do not use concatenation because the patch number is 
too much, i.e. 500, leading to a higher feature length, which is hard to fuse 
with the radiology feature. Compared with max-out, our multi-instance atten- 
tion achieved higher performance, indicating that different patches have differ- 
ent importance and our attention mechanism can effectively incorporate all the 
patches. 
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Table 2. Results on a single modality. 


Data Balanced-acc | Fl-micro | Kappa 
Radio 0.722 0.818 0.683 
Patho (Max-out) | 0.877 0.917 0.852 
Patho (MIA) 0.887 0.925 0.866 
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Results on Multiple Modalities. Then we evaluated our methods on the 
multiple modality data. Since the training of feature extraction and feature fusion 
are independent, we directly used the output feature of the single modality model 
as the input feature of the fusion stage. Particularly, the pathology feature refers 
to the feature obtained by our proposed multi-instance attention. We compared 
our methods with other feature fusion methods and the results are displayed in 
Table 3. Simply concatenating the features is treated as the baseline. Xue et al. 
[26] fused the two features by a learned linear weight, while Ma et al. [25] fused 
the scores of each modality by logistic regression. We reimplemented them on 
the proposed framework. 


Table 3. Comparison of different methods on multi-modal data. 


Method Balanced-acc F1-micro | Kappa 
Concat 0.866 0.917 0.851 
Linear Feature Fusion 0.886 0.932 0.878 
Linear Score Fusion 0.886 0.933 0.876 
Ours w/o Attention Fusion | 0.891 0.940 0.892 
Ours 0.912 0.948 0.906 


As pathology features and radiology features focus on different characteris- 
tics of gliomas, simple concatenation can not capture the relation between the 
two modalities. So when we concatenated pathology features and radiology fea- 
tures, the results got even worse compared with the single pathology feature. 
Linear feature fusion and score fusion introduce extra parameters to capture the 
relation between the two modalities, thus they got an improvement and were 
higher than every single modality. The results show that the two modalities are 
complementary and can benefit from each other. 

The linear fusion method is a simple linear combination of two features and 
there is no interaction between the two modalities. So we propose the cross 
attention module to interact between the two modalities and intend to enhance 
single modality features by digging complementary information from the other 
modality. The enhanced features are further fused by two linear weights which 
are derived from the attention matrix, i.e. attention fusion. As Table 3 shows, our 
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results outperform other methods by a large margin. We also conducted an abla- 
tion experiment that replaced the attention fusion module with a concatenation 
operation. The performance is also higher than other methods, which further 
demonstrates that the cross-attention module can explore complementary infor- 
mation from each other and form a comprehensive feature representation. 


4 Conclusion 


In this paper, we propose a collaborative attention network to utilize multi- 
ple modality data for the diagnosis of gliomas. The network consists of three 
attention-based feature fusion modules. The multi-instance attention combines 
different patch features from the pathology images to construct a holistic pathol- 
ogy feature. Then the pathology feature and radiology feature are fused by the 
cross attention module. The final feature representation is obtained by the atten- 
tion fusion module. Experimental results on CPM-RadPath demonstrate the 
effectiveness of the proposed method. 

The proposed attention fusion module recovers the reliability of different 
features according to their cross-attention matrices. No additional parameters 
are introduced in this module and it can be implemented with one line of code. 
Thereby, it can be served as a plug-and-play module and used in other multi- 
feature fusion tasks. 
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Abstract. In this work, we tackle the problem of Semi-Supervised 
Anomaly Segmentation (SAS) in Magnetic Resonance Images (MRI) 
of the brain, which is the task of automatically identifying patholo- 
gies in brain images. Our work challenges the effectiveness of current 
Machine Learning (ML) approaches in this application domain by show- 
ing that thresholding Fluid-attenuated inversion recovery (FLAIR) MR 
scans provides better anomaly segmentation maps than several different 
ML-based anomaly detection models. Specifically, our method achieves 
better Dice similarity coefficients and Precision-Recall curves than the 
competitors on various popular evaluation data sets for the segmentation 
of tumors and multiple sclerosis lesions. (Code available under: https:// 
github.com/FeliMe/brain_sas_baseline) 


Keywords: Semi-supervised Anomaly Segmentation - Anomaly 
detection - Brain MRI 


1 Introduction 


The medical imaging domain is characterized by large amounts of data, but 
their usability for machine learning is limited due to the challenges in shar- 
ing the data and the difficulties in obtaining labels, which requires annotations 
by expert radiologists and is time-consuming and costly. Especially pixel- or 
voxel-wise segmentation of different diseases in medical images is a tedious task. 
semi-supervised machine learning seems like a natural fit to gain insights into 
the analysis of medical images for diagnosis as it requires no annotations and 
can easily utilize the large amounts of data available. Especially valuable in 
this domain is Semi-Supervised Anomaly Segmentation (SAS). Here, unlabelled 
imaging data is used to build a system that can automatically detect anything 
that deviates from the “norm” when presented with unseen data. In medical 
images, this technique is particularly helpful as anomalies here often indicate 
morphological manifestations of pathology. 

Recently, SAS achieved impressive successes in automatic industrial defect 
detection [9,13,17,25] on the MVTec-AD data set [8]. In the medical imaging 
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domain, most works have focused on the detection of pathologies in brain images. 
Here, mostly autoencoder-based approaches have been applied so far [1,4,5,11, 12, 
26,27]. These techniques use only images from healthy subjects as training data to 
learn the distribution of “normal” brain anatomies. During inference, most of the 
approaches compute a so-called anomaly map as the pixel-wise residual between 
the input image and a predicted “normal” version of the same image generated 
by the model, that is closer to the training distribution. Common anomaly types 
in brain MRI are tumors and lesions from specific diseases such as multiple scle- 
rosis (MS). In fact, all of the aforementioned works evaluate their performance 
by detecting either of them or both. In clinical routine, MR images are typically 
acquired using different sequences or weightings in which the tissues appear in 
specific intensities. Among the most common ones are T1, T2, Fluid-attenuated 
inversion recovery (FLAIR), or Proton density (PD)-weighting. In FLAIR images 
— a standard protocol for routine clinical imaging in neurology — lesions are hyper- 
intense compared to the rest of the tissue and also tumors are usually brighter. 
Because of this, FLAIR images are often used in SAS of brain MRI [1, 4,5,20]. 

In our work, we leverage this prior knowledge to build a baseline that per- 
forms anomaly segmentation of brain MRI via simple thresholding of the input 
FLAIR image. In particular, the main contributions of our work are: 


— We show that learning the distribution of “normal” anatomies in FLAIR 
images using existing autoencoder-based approaches does not provide better 
segmentation maps of common anomalies in the brain than the input images 
themselves binarized at a certain threshold intensity. 

— We provide a simple baseline that requires no learning and outperforms most 
state-of-the-art SAS methods on common evaluation data sets containing 
brain tumors and MS lesions. 


2 Related Work 


Several methods for SAS in brain images have been introduced in recent years. 
Most of them are based on semi-supervised training of Autoencoders. The prin- 
ciple is depicted in Fig.1. The model is trained on images without anomalies 
only to learn a distribution of healthy brain images. During inference, the newly 
presented image is processed by the model to obtain a “healthy” version of the 
same image. Usually, an anomaly map is then obtained by computing the resid- 
ual between the input image and its “healthy” version. Pixels of the anomaly 
map above a threshold are then considered anomalous. 

In [19], the authors trained a Bayesian Autoencoder to perform anomaly 
segmentation on CT images. Chen and Konukoglu [11] built an Adversarial 
Autoencoder with an additional constraint forcing the input image and its recon- 
struction to be close in latent space. Another reconstruction-based technique was 
proposed in [5], where Baur et al. built a VAEGAN to increase reconstruction 
fidelity and realism of the reconstructed images. Zimmerer et al. [26,27] added 
gradient information from the loss-function of Variational Autoencoders (VAEs) 
to the reconstruction error, offering superior anomaly maps. 
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Fig. 1. Overview of Autoencoder- and GAN-based SAS. During training, the model 
learns the distribution of normal anatomies using only images of healthy patients. 
At inference time, the model generates a “healthy” version of the input image. The 
anomalies can be determined from the residual image. Image adapted from [4]. 


Restoration methods use the trained model to perform gradient optimization 
on the input image to construct an image that is both similar to the input and 
close to the distribution of normal anatomies learned by the model. Anomaly 
maps are again computed as the residual between the input- and the optimized 
image. An early example of this technique was proposed by Schlegl et al. [22]. 
They retrieve the closest version to an image that a Generative Adversarial 
Network (GAN) — trained on images of healthy patients only — can produce. 
Chen et al. [12] also used restoration by maximizing the evidence lower bound 
(ELBO) of an image on a Gaussian Mixture VAE (GMVAE). 

Recently, Baur et al. published a comparative study [4], comparing all the 
methods above on the same data sets with a unified architecture. We use their 
results in this work to compare our baseline against all of these techniques. We 
use the same data sets for evaluation and use a similar pre- and post-processing 
pipeline. In [6], Baur et al. proposed to use a U-Net-like Autoencoder with skip- 
connections and in [7], the same authors introduced a multi-scale Autoencoder 
utilizing a laplacian pyramid. While [6] and [7] were both trained on the same 
data and used identical pre-processing as [4], only [7] was evaluated on one public 
data set and can be compared in this work. Pinaya et al. [20] achieved impressive 
results in SAS of brain MRI. They trained a Vector Quantised VAE (VQ-VAE) 
on a large cohort of FLAIR images of healthy subjects and later trained an 
ensemble of autoregressive Transformers in its latent space. The Transformers 
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provide an explicit probability distribution of pixels in the latent space. Pixels 
with low posterior probability are considered anomalous. Since this method is 
not included in the comparative study by Baur et al. [4], we compare our results 
to theirs in a separate experiment. 

Lastly, anomaly detection was used by van Hespen et al. [23] to detect chronic 
brain infarcts on MRI. They made a patch-based detection approach using a 
scoring function based on the latent space distances instead of the reconstructed 
image. The anomaly score for the whole image is calculated as a combination 
of all patches, resulting in a coarse segmentation map. We did not include this 
method in our experiments, because the models were trained on non-publicly 
available data and the model parameters are not open-source. However, they 
showed that SAS methods are able to spot unseen anomalies. Their system was 
able to identify anomalies missed in the annotation of an expert radiologist, 
proving the usefulness of such approaches. 


3 Experiments 


In the following, we present the data sets we used to evaluate our baseline, pre- 
and post-processing steps and evaluation metrics. 


3.1 Datasets 


We compare our baseline to all the publicly available data sets used for evaluation 
in Baur et al. [4] and Pinaya et al. [20]. 

To evaluate brain tumor detection, we use the training set of the 2020 ver- 
sion of the Multimodal Brain Tumor Image Segmentation Benchmark (BraTS) 
(2,3, 18]. It contains T1, T2, and FLAIR scans of 371 subjects acquired across 19 
institutions with multimodal, 3 T MRI scanners. It also contains manual segmen- 
tations of the tumor regions by up to four raters. The BraTS images are already 
skull stripped. The MSLUB [16] data set consists of T1, T2, and FLAIR images 
of 30 subjects with multiple sclerosis (MS). They have been acquired at the Uni- 
versity Medical Center Ljubljana (UMCL) with a 3T Siemens Magnetom Trio 
MR system. The consensus of three experts on white matter lesion segmentation 
is also included. As in [20], we evaluate on the White Matter Hyperintensities 
Segmentation Challenge (WMH) [15]. For this data set, T1 and FLAIR scans of 
60 patients were acquired at three different sites in the Netherlands and Singa- 
pore. The sites used 3T MRI scanners from Philips, Siemens, and GE. Manual 
segmentation of the lesions was conducted by an expert radiologist. Lastly, we 
use the training data of the 2015 Longitudinal MS Lesion Segmentation Chal- 
lenge [10]. This dataset has 21 T1, T2, PD, and FLAIR weighted MRI scans from 
5 subjects recorded at the John Hopkins MS Center with a 3 T Philips scanner. 
Manual lesion segmentations are available from two raters. We use the ratings 
of rater one (as indicated by the filename “mask1.nii”) for our evaluation. 
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Tumors are usually much larger anomalies than MS lesions. We evaluated the 
exact distribution of anomaly sizes by performing a 3D connected component 
analysis on the segmentation maps of all data sets (Table 1). MSLUB has the 
smallest anomalies and also the largest number of anomalies per scan. 


Table 1. Results of the 3D connected component analysis of the segmentation maps 
of all data sets after being registered to SRI space [21] and binarized with threshold 
0.9 (See Sect. 3.2). 


BraTS | MSLUB | WMH | MSSEG2015 
Avg. anomalies per scan 5 | 107 65 35 
Avg. anomaly size (voxels) | 18027 |106 194 224 


3.2 Pre-processing 


Our pre-processing pipeline closely follows Baur et al. [4]. First, we skull strip 
the FLAIR scans using ROBEX [14]. Subsequently, we register them to the SRI 
space [21]. Specifically, since [21] does not contain a FLAIR Atlas, we register 
the T1-weighted images of all data sets and apply the same transformation to 
the FLAIR images and the ground truth segmentation masks. This is possible, 
as Tl- and FLAIR images and the segmentation files are co-registered in all 
the data sets used. Performing registration before skull stripping resulted in 
failed registrations in early experiments. The registration step is not vital for 
our algorithm but was purely done to ensure comparability with other methods. 
Figure 2 shows samples of pre-processed images from all four data sets. 


MSSEG2015 


AR 


Fig. 2. Pre-processed samples and histogram-equalized (top row) and their correspond- 
ing ground truth segmentations (bottom row) from the four data sets. 


During the registration process, aliasing effects occur in the — initially binary 
— ground truth segmentation masks that cause these masks to also have non- 
binary voxel values between 0 and 1 after registration. When loading the data, 
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a decision needs to be made at which threshold a voxel in the segmentation map 
belongs to the segmented region. We consulted an expert radiologist and visually 
found 0.4 to be an acceptable threshold, but finally decided to follow Baur et al. 
[4] in using 0.9, to ensure better comparability. Note that altering this threshold 
has large effects on the performance of the evaluated models, especially on data 
sets with many small anomalies like lesions. A low threshold favors models that 
overestimate the true size of the anomalies, while a high threshold does the 
opposite. 


3.3 Method 


While other SAS methods usually compute anomaly maps using Neural Net- 
works, we propose to only perform histogram equalization on the pre-processed 
input images and use the results directly as anomaly maps since lesions and 
tumors often are hyperintense in FLAIR images anyway. Histogram equalization 
is necessary to compensate for contrast variations among different scanner types 
and allows to define a global (or at least dataset-wise) threshold for binarization 
of the anomaly maps. We used the equalize_hist function of scikit-image [24] 
with the default value of 256 bins and a binary mask considering only pixels 
belonging to the brain and excluding the background. Using FLAIR images is 
a fair comparison since Baur et al. [4] and Pinaya et al. [20] also trained and 
evaluated on FLAIR images only. Our method does not require any training 
data or learning procedure and scales trivially to arbitrary resolutions. 


3.4 Post-processing 


As our only post-processing step, we perform a connected component analysis 
per scan on the 3D voxels as in [4] and discard all anomalies with less than 20 
voxels. This value was found empirically and causes our algorithm to potentially 
miss very small anomalies. However, it greatly reduces the noise in the anomaly 
maps and thereby enhances their readability. 


3.5 Metrics 


We quantitatively assess the anomaly segmentation performance of our method 
using a variety of metrics also frequently found in related works. All metrics 
are produced dataset-wise. Initially, we compute Precision-Recall curves and 
report the area under it (AUPRC). We also provide an upper limit for the Dice 
similarity coefficient ([DSC]), computed using a search over n = 100 thresholds. 
Lastly, we also provide the area under the receiver operating characteristics curve 
(AUROC). 
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Table 2. Comparison of our proposed baseline to selected models of Baur et al. [4] 
and [7]. We used slices 15 to 125 of the registered FLAIR images and a resolution of 
128 x 128. 


MSLUB MSSEG2015 

Method [DSC] | AUPRC | AUROC | [DSC] | AUPRC | AUROC 
AE (dense) [4] 0.271 |0.163 [0.794 [0.185 |0.080 | 0.879 
AE (spatial) [4,5] [0.154 |0.065 [0.732 |0.106 [0.037 0.781 
VAE (rest.) [4,12] [0.333 [0.275 (0.839 |0.272 |0.202 | 0.905 
GMVAE (rest.) [4,12] | 0.332 | 0.271 0.836 0.280 | 0.199 0.909 
f-AnoGAN [4,22] 0.283 | 0.221 0.856 0.342 | 0.255 0.923 
SSAE (spatial) [7] 0.301 | 0.222 = = = = 

Ours 0.374 | 0.271 0.991 0.431 | 0.262 0.996 


4 Results 


We evaluate our method in two experiments. First, we report the performance 
when using slices 15 to 125 on a resolution of 128 x 128 as in the experiments 
of Baur et al. [4] and [7]. These slices contain most of the brain region in the 
SRI space [21] and tests did not show significant differences in the quantitative 
evaluation compared to the full volumes. The results of experiment one are shown 
in Table 2. Although for [4] the code is available online, we did not re-train the 
models but used the values reported in the respective papers because the training 
data used is not publicly available. We only report the numbers of a subset of 
the best performing models, the others can be inspected in the original paper. In 
our experiments, our proposed baseline outperforms all other methods in terms 
of DSC and AUROC and is competitive in AUPRC. While all models in [4] use 
a unified architecture, the detailed architecture of [7] is unknown, and the two 
papers report significantly different performances for the same models on the 
same data sets, indicating volatility of these methods (Table 3). 


Table 3. Comparison of our proposed baseline to Pinaya et al. [20]. We used slices 84, 
85, 86, and 87 of the registered FLAIR images and a resolution of 224 x 224. 


[DSC] 
Method BraTS | MSLUB | WMH 
Transformer [20] | 0.759 | 0.465 0.441 
Ours 0.738 |0.613 | 0.557 


In our second experiment, we compare to Pinaya et al. [20] at a resolution 
of 224 x 224. In this experiment, there are some differences regarding pre- and 
post-processing. Pinaya et al. [20] evaluate on data that was not skull stripped, 
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except for BraTS. They also did not perform any post-processing on the BraTS 
data set. They registered to MNI space that has 189 slices instead of SRI with 
155 slices. We therefore used slices 84, 85, 86, and 87 instead of 89 to 92 to still 
ensure a fair comparison. Lastly, they used the older 2017-version of the BraTS 
dataset, whereas we used the latest 2020-version. Our baseline outperforms the 
Transformer strongly on the MSLUB and WMH data sets and performs only 
slightly worse on the BraTS data set. 

Figure3 shows the qualitative results of our proposed baseline. The visual 
segmentation quality based on image-hyperintensities is decent and shows the 
approximate localization of anomalies. 


MSSEG2015 
aby 


Fig. 3. Qualitative results of our baseline. Two samples are shown for each data set. 
Top row: input image. Middle row: predicted anomaly map, binarized using the thresh- 
old that yields the best DSC for each data set. Bottom row: ground truth anomaly 
segmentation. 


We also present the quantitative results of the two experiment settings for all 
data sets using all metrics in Table 4. In experiment one, our proposed method 
performs best on the BraTS data set which has the largest anomalies, and worst 
on MSLUB with the smallest anomalies. This can partly be attributed to our 
post-processing where we discard connected components with less than 20 voxels. 
Datasets with smaller anomalies are more affected by this. Also in experiment 
two, BraTS is the data set with the highest [DSC] and AUPRC. 


5 Discussion 


The results in Sect.4 show that a simple baseline can outperform or compete 
with even the strongest related Machine Learning (ML) techniques. These find- 
ings challenge the effectiveness of current ML approaches for SAS. The results 
of Baur et al. [4] also show that DSC does not correlate well with reconstruc- 
tion quality. Especially, one can see in Fig.4, that the best performing models 
(VAE with restoration, dense GMVAE with restoration, and fAnoGAN) pro- 
duce very textureless reconstructions. They can detect the largest connected 
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Table 4. Full results of our proposed baseline on the two experimental settings. Exper- 
iment I: using slices 15 to 125 of the registered FLAIR images and a resolution of 
128 x 128. Experiment II: using slices 84, 85, 86, and 87 of the registered FLAIR 
images and a resolution of 224 x 224. 


[DSC] | AUPRC | AUROC 


Experiment I 


BraTS 0.666 | 0.671 0.988 
MSLUB 0.374 | 0.278 0.991 
WMH 0.457 | 0.339 0.979 


MSSEG2015 | 0.431 | 0.262 0.996 


Experiment IT 


BraTS 0.738 | 0.762 0.985 
MSLUB 0.613 | 0.571 0.993 
WMH 0.557 | 0.504 0.984 


MSSEG2015 | 0.593 | 0.536 0.996 


anomaly located at the dorsal aspect of the right lateral ventricle (note that the 
images are oriented such that the patients’ right ventricle is on the left side of 
the image) only because it is hyperintense in the input image. We refer to the 
original paper for a higher-resolution version of this figure. Hence, we hypothe- 
size that the models in Baur et al. [4] do not perform anomaly segmentation by 
learning the normal anatomy of the data, but that the necessary information to 
perform anomaly segmentation with the performance presented in our work is 
already present in the input image. The quantitative evaluation of our experi- 
ments indicates that using the residual between the model output and the input 
image actually degrades the segmentation quality of the resulting anomaly map. 

While we are aware that our baseline can only detect anomalies that are 
hyperintense, we argue that other techniques — especially those using resid- 
ual maps between the input image and a reconstructed or restored version as 
anomaly maps — are not assumption-free, but also impose strong biases on the 
types of anomalies they can detect. For example, Alzheimer’s disease, where 
one of the symptoms is atrophy of regions of the brain, cannot reliably be 
detected using pixel-wise residuals. Some of the existing works [4,20] leverage the 
same prior knowledge by considering only positive residuals as anomalies for MS 
lesions. However, our approach appears to make better use of this knowledge. 

We point out that there exist anomaly segmentation methods like [23] that 
have shown to be able to detect anomalies that are not necessarily hyperintense. 
These methods, however, do not base their anomaly score on the reconstruction 
error but have other inductive biases. Van Hespen et al. [23] limit the receptive 
field of their model with the patch size used. 
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Fig. 4. Reconstructions (top row) and residuals (bottom row) of different ML-based 
SAS techniques. The best performing models are highlighted in red. Image from and 
best viewed in [4]. (Color figure online) 


6 Conclusion 


In this work, we advanced the current state-of-the-art in SAS of brain MRI by 
introducing a simple method that requires no learning. Our findings challenge 
the effectiveness of existing ML-based SAS approaches. While our work out- 
performs competing methods, the results still lack behind the ones of expert 
radiologists and supervised methods presented in [10,15] and [3]. This provides 
evidence for the need to explore alternative methods that overcome current lim- 
itations. These could include new scoring functions or multi-modal approaches. 
We also encourage the use of prior knowledge to build these models. While this 
seems counter-intuitive at first — given the promise of SAS being able to detect 
any kind of anomalies — we argue that current methods are also severely lim- 
ited by their scoring functions in the types of anomalies they are theoretically 
able to detect. To this regard, we will explore the use of artificial anomalies in 
anomaly segmentation. We hypothesize that through careful creation and selec- 
tion of artificial anomalies, models can generalize to real anomalies. Our work 
also highlights the requirement for a benchmark data set to better compare dif- 
ferent techniques against each other. This benchmark should contain relevant 
real-world anomalies of brain MRI, but should also not be sufficiently solved via 
non-ML methods. Another disadvantage of the presented models is their limited 
spatial scope. Current SAS methods process the 2D slices of a 3D volume indi- 
vidually. We suspect that making better use of the 3D information of MRI will 
improve the anomaly detection performance of the models. We plan to explore 
the use of 3D machine learning models in future work as they can fully incorpo- 
rate 3D information, while humans can only process volumes — such as MRI — 
slice-wise. 
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Abstract. We present a method to segment MRI scans of the human 
brain into ischemic stroke lesion and normal tissues. We propose a neural 
network architecture in the form of a standard encoder-decoder where 
predictions are guided by a spatial expansion embedding network. Our 
embedding network learns features that can resolve detailed structures in 
the brain without the need for high-resolution training images, which are 
often unavailable and expensive to acquire. Alternatively, the encoder- 
decoder learns global structures by means of striding and max pooling. 
Our embedding network complements the encoder-decoder architecture 
by guiding the decoder with fine-grained details lost to spatial down- 
sampling during the encoder stage. Unlike previous works, our decoder 
outputs at 2x the input resolution, where a single pixel in the input 
resolution is predicted by four neighboring subpixels in our output. To 
obtain the output at the original scale, we propose a learnable down- 
sampler (as opposed to hand-crafted ones e.g. bilinear) that combines 
subpixel predictions. Our approach improves the baseline architecture by 
11.7% and achieves the state of the art on the ATLAS public bench- 
mark dataset with a smaller memory footprint and faster runtime than 
the best competing method. Our source code has been made available 
at: https://github.com/alexklwong/subpixel-embedding-segmentation. 


1 Introduction 


A stroke occurs when a lack of blood flow prevents brain tissue from receiving 
adequate oxygen and nutrients. This condition affects over 795,000 people annu- 
ally [28]. The severity of the outcome, including disability and paralysis, depends 
on the location and intensity of the stroke, as well as the time of diagnosis [2,30]. 
Preserving cognitive and motor functions, therefore, hinges on localizing stroke 
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lesions quickly and precisely. However, doing so manually requires expert knowl- 
edge, is time consuming, and is ultimately subjective [11,13]. 

We focus on automatically segmenting ischemic stroke lesions, which account 
for 87% of all strokes [28], from T1l-weighted anatomical magnetic resonance 
imaging (MRI) brain scans. These lesions are characterized by high variability 
in location, shape, and size — the latter two are problematic for conventional con- 
volutional neural networks (CNNs) where precision of irregularly shaped lesion 
boundaries and recall of small lesions are critical measures of success. Due to 
aggressive spatial downsampling (i.e. max pooling, strided convolutions) custom- 
ary in CNNs, details of local structures are lost in the process. Yet, the spatial 
downsampling is necessary for obtaining a global representation of the input 
while using fixed-size filters with limited receptive fields. The outcome of which 
are segmentations with ambiguous boundaries between lesion and normal tissues 
and missed lesions that occupy small number of voxels in the MRIs. 

We propose to retain small local structures by learning an embedding that 
maps the input to high dimensional feature maps of twice the input resolution. 
Unlike the typical CNN, we do not perform lossy downsampling on this rep- 
resentation; hence, the embedding preserves local structures, but lacks global 
context. When combined with the standard encoder-decoder e.g. U-Net [19], 
the embedding complements the encoder-decoder by supplying the decoder with 
fine-grained detail information to guide segmentation. Our network also outputs 
at twice the resolution of the input, representing each element in the input with 
a 2 x 2 neighborhood of predictions. The final output is obtained by combining 
the four predictions (akin to an ensemble) as a weighted sum where the contribu- 
tion of each prediction is learned from the data. Our design not only enables the 
network to produce robust segmentations but also localize small lesions (Fig. 3). 

Our contributions include (i) an embedding function that preserves fine- 
grained details of the input by mapping it to larger spatial dimensions, (ii) a 
neural network architecture that leverages the complementary strengths of the 
proposed embedding and an encoder-decoder to produce predictions at twice the 
input resolution, and (iii) a learnable downsampler that combines local predic- 
tions in an ensemble fashion to yield robust segmentations at the input resolu- 
tion. Our approach improves the baseline U-Net architecture by ~ 11.7% and 
achieves the state of the art on the ATLAS [11,12] dataset with lower computa- 
tional burden than the best competing method. 


2 Related Work 


Lesion Segmentation. Early works [4] aggregated classification results for 
the center pixel of patches sampled from an image. However, [4] lacked global 
context, so [21] addressed this with multi-stage cascaded hierarchical models. 
More recent works build upon the U-Net [19], a 2D fully-convolutional net- 
work with skip connections and up-convolutions. For example, [14] used a Dual 
Path Network [3] encoder while [26] leveraged dilated convolutions to inexpen- 
sively increase receptive fields. Furthermore, [1] fused the U-net with other 
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high-performing modules, the BConvLSTM [24] and the SENet [8], and [18] 
introduced X-blocks to the U-Net, leveraging depthwise separable convolutions 
to reduce computational load. [31] used skip connections between successive 
encoder resolutions to prevent the loss of features and ConvLSTM [23] modules 
to maintain localization. 

Recent works also leveraged 3D architectural backbones to improve localiza- 
tion. [32] performed 3D convolutions on a subsection of the scan and fused the 
results with 2D convolutions. [9] proposed an attention gate to combine 2D seg- 
mentations along the axial, sagittal, coronal planes into a 3D volume. However, 
these works use significantly larger memory footprints and 3D convolutions are 
computationally expensive — limiting the models’ practicality. We note that while 
conventional architectures perform well globally (i.e. recovering the coarse shape 
of lesions) they struggle to segment small lesions that blend into the background. 


Super-Resolution. There is an abundance of works in natural images super- 
resolution [5,6,22,25,29] and a growing number in medical imaging. [20] pro- 
posed to map MRI images from low to high-resolution with an overcomplete 
dictionary. [16] leveraged SRCNN [5] for super-resolving 2D MRI images and 
fused them to obtain a 3D volume. [17] handled arbitrary scaling factors with a 
3D architecture for multi-modal 3D data. However, these works require low and 
high-resolution image pairs for training and are limited to the super-resolution 
task while our method does not rely on a larger resolution ground truth. More 
recently, [27] introduced Kite-Net, an upsampling encoder that outputs a latent 
at 8x resolution followed by a max-pooling decoder to downsample back to 
the original resolution. Kite-Net is used in parallel with a U-Net for lesion seg- 
mentation. Our approach draws inspiration from super resolution and latent 
over-representations as methods to retain local structure that are often lost in 
spatial downsampling. However, unlike [27], we avoid downsampling the latent 
with pooling (which discards information), and instead employ lossless space-to- 
depth and depth-to-space [22] operations to retain fine-grained details. Further- 
more, we propose to learn a subpixel embedding at 2x the original resolution 
to guide our segmentation, which uses a much smaller memory footprint than 
[27]. We show that our approach can capture small lesions that are missed by 
[18, 19, 27, 31, 32]. 


3 Method 


We propose a method to partition a 3D MRI volume X € R°*#*™ into lesion 
(positive, 1) and normal (negative, 0) classes. Our method takes, as input, a 
3D slice of c consecutive 2D images x € R°%#xW (eis an odd integer) from X 
and predicts the binary segmentation for the image 7 € RIXĦXW the lth 
image of x. In other words, x is a sliding window of c images centered at a 
target image Z. To avoid sampling out of bounds, we perform mean padding of 
size + x H x W on both sides of X before sampling x (see Sec. 1 of Supp. 


2 
Mat. for more details). To segment a single image 7, we propose to learn a deep 


78 A. Wong et al. 


J at 
1% 16 16 32 i N 
> 
ES 
g 
` x í ResNet Block 
a: 16 h 24 g(z) f í Convolutions 
N fole) í Depth-to-Space 
16 NN 
| | N í Space-to-Depth 
2 32 x a 
d | >. i í Decoder Blocks 
64 | p Latent Vector 
í Learnable Downsampler 


Fig. 1. Network architecture. SPiNis comprised of (i) a U-Net based encoder-decoder 
that produces subpixel predictions f° (x) at 2x the input resolution, which are guided 
by (ii) a subpixel embedding that captures local structure. The final output f.,(2) is 
achieved by combining local predictions in a 2 x 2 neighborhood as a weighted sum 
based on the per element contribution predicted by a (iii) learnable downsampler. 


neural network fu, parameterized by w, where f : R°X#*W 1, [0,1])1x4*x¥ 


is a function that takes the 3D slice x as an input and outputs the sigmoid 
response f(x), a confidence map corresponding to lesions in z. To obtain the 
binary segmentation of X, we aggregate our predictions by running fw for all 
x and setting any response greater than a threshold of 0.5 to the lesion class. 
We note that our method can be extended to multi-class segmentation simply 
by expanding our output to [0,1}**"#*™ for K classes, and choosing the class 
with highest response, i.e. arg max f,,(-), to yield the segmentation. 


3.1 Network Architecture 


Our network f,, (Fig. 1) is composed of two modules: (i) an encoder-decoder 
(based on U-Net [19]) that outputs at 2x the input resolution, e.g. 2H x 2W, 
whose predictions are guided by (ii) a network that maps the input x to a high 
dimensional embedding space also at twice the input resolution. The result is a 
confidence map comprised of “subpixel” predictions — the output class for each 
input pixel is represented by four predictions within a 2 x 2 neighborhood. Rather 
than using hand-crafted downsampling techniques (e.g. bilinear, nearest neigh- 
bor) to obtain the output at the original (1x) spatial resolution, we propose a 
learnable downsampler that predicts the weight, or contribution, of each sub- 
pixel prediction in a local region corresponding to the pixel in the 1x resolution. 
For simplicity, we refer to our embedding function as a subpixel embedding and 
our overall architecture (fu) as a subpixel network or “SPiN” for short (Fig. 1). 
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Fig. 2. Learnable Downsampler, Space-to-Depth and Depth-to-Space. (a): Learnable 
Downsampler predicts the contribution h(z) of each subpixel prediction in f2(a) by 
conditioning on f2(x) and the latent vector g(x). Subpixel predictions f} (x) are rear- 
ranged to the resolution of the input using Space-to-Depth. The final output fi. (x) 
is produced by taking the element-wise dot product between h(z) and the reshaped 
fo(x). (b) Space-to-Depth reduces resolution by rearranging elements from the spatial 
dimensions into the channel dimensions, where each 2 x 2 neighborhood is reshaped to a 
4 element vector. Depth-to-Space conversely performs spatial expansion by rearranging 
elements from the channel dimensions to height and width dimensions. 


Subpixel embedding consists of feature extraction and spatial expansion 
phases. Feature extraction is performed by two ResNet blocks [7] with 16 filters 
per layer; we also use stride of 1 and zero-padded edges to minimize spatial 
reduction. The extracted 16 x H x W feature maps are fed to a depth-to-space 
module [22] that rearranges elements from the channel dimension to the height 
and width dimensions (see Fig. 2-(b)). The resulting set of 4 x 2H x 2W feature 
maps with twice the spatial resolution then undergoes a 1 x 1 and a 3 x 3 con- 
volution layers, with 8 filters each. The resulting 8 x 2H x 2W high dimensional 
feature maps, produced by our subpixel embedding function, resolve fine local 
details by increasing the feature map resolution and thus representing informa- 
tion at each pixel location with four “subpixel” feature vectors. 

When used as skip connections, these embeddings complement the standard 
U-Net architecture that obtains a global representation of the input by spatial 
downsampling (striding and max pooling), which naturally discards local detail. 
Hence, we propose to inject these embeddings into the decoder via feature con- 
catenation at the original (1x) resolution and at the 2x output resolution. To 
reduce the height and width dimensions of the embeddings to match the feature 
maps at the 1x resolution, we propose a space-to-depth module, which performs 
the inverse operation of depth-to-space (see Fig. 2-(b)), yielding 32 x H x W 
feature maps. Unlike striding and pooling, the depth-to-space operation is infor- 
mation preserving as it rearranges feature vectors from the height and width 
dimensions to their channel dimension. The result is fed through a 3 x 3 con- 
volutional layer with 8 filters and concatenated with the feature maps of the 
decoder at the 1x resolution. Similarly, the embeddings at 2x resolution undergo 
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a separate 3 x 3 convolution to yield the output resolution guidance before being 
concatenated with their corresponding feature maps in the decoder. Finally, the 
2x decoder output f(x) € [0,1]1*?4*?™ is produced by convolving a single 
3 x 3 filter over the resulting latent vector g(x) € R?4x?Hx2W. We use subpixel 
guidance (SPG) to refer to the process of learning and injecting the embedding as 
skip connections, which substantially helps with localizing small lesions missed 
by previous works [18,19,31,32] (see Fig. 3). We note that SPG is light-weight 
and only uses 16K parameters. 

Learnable downsampler takes the concatenation z = [g(x); f2(x)] of the 
latent vector g(x) and the 2x resolution output f(x) and predicts h(z), where 
h : R?5X2Hx2W „> (9, 1]4*4*W In other words, h(z) is a set of 4x H x W 
values that determine the contribution of each subpixel prediction in a 2 x 2 
neighborhood of f°(x). To achieve this, we first perform space-to-depth on z to 
rearrange each 2 x 2 neighborhood into a 4 element vector. This is followed by 
two 3 x 3 convolutions of 16 filters and a 1 x 1 convolution with 4 filters. h(z) is 
the softmax response of the result along the channel dimension. 

To obtain the final output fẹ (x), we utilize space-to-depth to rearrange f° (2) 
into the shape of 4x H x W (to match the shape of h(z)) and take its element-wise 
dot product with h(z). With an abuse of notation, fu (x) = f°(x)-h(z). Because 
h(z) is conditioned on the latent vector g(x) of the input, the predicted weights 
respect lesion boundaries to yield detailed segmentations. This is unlike bilinear 
or nearest-neighbor downsampling where weights are predetermined and inde- 
pendent of the input. We note that our learnable downsampler is also lightweight 
and only consists of 11K parameters. 


3.2 Loss Function 


We assume a training set of {(a™,9™)}4_,, where g™ is the ground truth 
corresponding to z(™), the image located at the center of x). To train SPiN, 
we minimize the standard binary cross entropy loss, 


&y,¥) = = X - (lu) log y(u) + (1 — g(u)) log(1 — y(u))), (1) 


where 2 C R? denotes the spatial image domain, u a pixel coordinate, and 
y = f(x) the network output. The loss over the training set of N samples reads 


L(w) = A efile), 9). (2) 


We note that previous works [31,32] used soft Dice loss (an approximation of 
the true Dice score) to counter the class imbalance between normal and lesion 
tissues, characteristic in the lesion segmentation problem. However, a minimizer 
of cross entropy equivalently minimizes Dice, and empirically, we found that 
directly minimizing cross entropy yields better performance for our model. We 
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Table 1. Evaluation metrics. IOU denotes Intersection Over Union, and DSC denotes 
Dice similarity coefficient. TP, FN and FP correspond to true positive, false negative 
and false positive respectively. 


Metric IOU DSC Precision | Recall 
7 TP 2xTP TP TP 
Definition | TPEFNFFP | 3xTP4FNFFP | TP}FP TPHFN 


hypothesize that our SPG allows small lesions to be recovered more easily, mak- 
ing our method more conducive to minimizing cross entropy, which is not prone 
to the noisy training signal inherent in soft Dice. We demonstrate this in row 7 of 
Table 4 in our ablation studies. Also, we note that our loss can be easily extended 
for multi-class classification to accommodate multiple lesion categories. 


4 Experiments and Results 


We demonstrate our method on the Anatomical Tracings of Lesion After Stroke 
(ATLAS) MRI dataset [11,12], using the metrics defined in Table 1. ATLAS con- 
tains 304 T1-weighted MRI scans of stroke patients with corresponding lesion 
annotations. The data is collected from 11 research sites worldwide, manually 
annotated, and post-processed (i.e. smoothing and defacing for privacy), leav- 
ing 239 patient scans with 189 2D images (197 x 233 resolution) each. Since no 
official data split is provided by [11], previous works [18,31,32] evaluated their 
methods using k-fold cross validation and randomly sampled data splits. How- 
ever, the value of k and samples within each split varied across works. Due to 
the lack of consistency, the reported results are not directly comparable. Thus, 
we propose a training (212 patients) and a held-out testing (27 patients) split 
to standardize the evaluation protocol for more rigorous comparisons. We pro- 
vide quantitative comparisons against [18,19,27,31,32] on the proposed training 
and testing split in Table 2. We also show qualitative (Fig. 3) and quantitative 
(Table 3) comparisons on segmenting small lesions using a subset of test set: 490 
images containing only lesions smaller than 100 pixels (0.2% of the image). All 
reported results for previous works are obtained using their training procedures 
and open-sourced code. We also provide details on our training and testing split 
in Sec. 2 of Supp. Mat. and further k-fold cross validation comparisons in Sec. 3 
of Supp. Mat. 


Implementation Details. Our model is implemented in PyTorch [15] and 
optimized using Adam [10]. We used an initial learning rate of 3x 1074, decreased 
it to 1 x 1074 after 400 epochs, and to 5 x 1075 after 1400 epochs for a total 
of 1600 epochs. We choose c = 5 for the number images in the input x. During 
training, Z and its corresponding x are randomly sampled from X. Training takes 
8h on an Nvidia GTX 1080 GPU, and inference takes ~ 11 ms per 2D image. 
For data augmentation, we randomly perform (i) horizontal and vertical flips, 
(ii) rotation between —30° and 30°, and (iii) add zero-mean Gaussian noise with 
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Ground truth Ground trut il -l -l CLCI-Net KiU-Net D-UNet 


Fig. 3. Qualitative results on ATLAS. Columns 2-8 show (zoomed in) head-to-head 
comparisons across all methods for highlighted areas in column 1. Row 1 demon- 
strates that SPiNoutperforms existing works in capturing shape and boundary details 
in medium-sized, irregularly-shaped lesions. Furthermore, rows 2 and 3 demonstrate 
SPiN’s ability to localize small lesions that are missed by other models. 


Table 2. Quantitative comparison on ATLAS. SPiNoutperforms all methods across 
all performance metrics. It is also one of the least computationally expensive models, 
i.e. smallest test time memory footprint, second in training memory usage, and third 
fastest in runtime per patient (189 images). 


Method Performance metrics Runtime (s) | Memory usage (GB) 
DSC |IOU | Precision | Recall Train | Test 
U-Net [19] 0.584 | 0.432 | 0.674 0.558 1.375 2.291 | 1.181 
D-UNet [32] 0.548 | 0.404 | 0.652 0.521 3.425 15.426 | 15.426 
CLCI-Net [31] | 0.599 | 0.469 | 0.741 0.536 8.860 7.853 | 7.853 
KiU-Net [27] | 0.524 | 0.387 0.703 0.459 1.05 23.566 | 1.555 
X-Net [18] 0.639 | 0.495 | 0.746 0.588 5.046 11.839 | 11.839 
SPiN(Ours) 0.703 | 0.556 | 0.806 0.654 | 2.145 3.273 | 0.803 


standard deviation of 1 x 10~? to training samples. We perform augmentation 
with a probability of 1 for 1400 epochs and decrease it to 0.5 thereafter so 
training samples will be closer to the true distribution of the dataset. 


ATLAS Test Set. Table2 shows that our approach outperforms competing 
methods [18, 19, 27,31, 32] across all evaluation metrics. Specifically, we beat the 
best performing method X-Net [18] by an average of +10.4% with a 72.3% reduc- 
tion in training memory and a 57.5% runtime reduction during inference. Our 
approach also uses a smaller memory footprint, containing only ~5.3M param- 
eters, compared to 15M in [18]. Another key comparison is with KiU-Net, 
which learns a representation at 8x the original input spatial resolution. Unlike 
us, KiU-Net [27] uses max pooling layers, which discards information, to reduce 
the size of their high resolution representation to the original (1x) resolution. 
Whereas, we maintain the 2x resolution of our embedding until the output layer, 
which yields subpixel predictions that are aggregated by our learnable downsam- 
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Table 3. Evaluation on small lesion subset. While [31] achieves the highest precision, 
we note they have the second lowest recall out of all methods — missing small lesions can 
negatively impact patient recovery. In contrast, our method ranks second in precision 
and first across all other metrics. 


Method DSC |IOU | Precision | Recall 
U-Net [19] 0.368 | 0.225 | 0.440 0.316 
D-UNet [32] | 0.265 0.180 | 0.377 0.264 
CLCLNet [31] | 0.246 0.178 | 0.662 0.215 
KiU-Net [27] | 0.246 | 0.255 | 0.466 0.206 
X-Net [18] 0.306 | 0.213 | 0.546 0.268 
SPiN(Ours) 0.424 | 0.269 | 0.546 0.347 


Table 4. Ablation study on ATLAS. Removing SPG and/or LD results in performance 
decrease (rows 1, 2, 6), and SPG cannot be substituted with more parameters or 
interpolation (rows 3-5). The best results are achieved by our full model (row 8). 


Method DSC |IOU | Precision | Recall 
Without SPG, LD (Baseline) 0.634 | 0.487 | 0.707 0.606 
Without SPG 0.637 | 0.487 | 0.701 0.613 


Replace SPG with addit. convolutions | 0.627 | 0.475 | 0.721 0.596 
Replace SPG w/bilinear upsampling | 0.663 | 0.513 | 0.780 0.600 
Replace SPG w/nearest upsampling | 0.660 | 0.513 | 0.762 0.626 


Replace LD with downsampling 0.670 | 0.526 | 0.786 0.625 
Full model with soft Dice loss 0.684 | 0.546 | 0.729 0.672 
Full model 0.703 | 0.556 | 0.806 0.654 


pler to the 1x resolution. Admittedly, this comes at the cost of runtime — our 
method requires 2.145s per patient and KiU-Net [27] requires 1.05s. However, 
we outperform [27] by an average of 33.7% across all metrics and reduce test 
time memory by half. We show qualitative comparisons in row 1 of Fig. 3 where 
the segmentation produced by our approach better captures irregularly shaped 
lesions than those predicted by competing methods. 


Small Lesion Segmentation. Here, we consider the task of segmenting lesions 
that occupy fewer than 100 pixels or 0.2% of the image. Due to the challenging 
nature of the task, we observe an expected drop in performance across all meth- 
ods (trained on the proposed split) when segmenting small lesions (Table 3), as 
compared to doing so for all lesion sizes (Table 2). However, we still outperform 
all competing methods — by even larger margins than on the full test set. This 
shows that competing methods, while able to localize large and medium sized 
lesions, actually perform poorly on small lesions. With the exception of preci- 
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sion, where we tie for second with X-Net [18], we rank first in all other metrics. 
We note that while CLCI-Net [31] has the highest precision, it also achieved 
second lowest recall, meaning that it misses many small lesions, which is critical 
to clinical prognosis and thus patient recovery. This is also reflected in DSC and 
IOU where we outperform [31] by 72% and 51%, respectively. Qualitatively, rows 
2 and 3 in Fig.3 show that our method successfully localized small lesions that 
(18,19, 27,31,32] missed entirely. 


Ablation Studies. Table4 shows the effect of each of our contributions to 
architectural design. Row 1 shows that our baseline, a U-Net [19] based encoder- 
decoder, performs significantly worse by 11.7% than the proposed approach 
because it lacks fine local details from SPG and uses bilinear downsampling 
instead of a learnable downsampler (LD). Including LD alone, but not SPG 
(row 2) provides no improvement as the network only learns a coarse global 
representation, but is still missing details lost during spatial downsampling. 

In row 3, we show that solely increasing parameters (i.e. adding ResNet 
blocks [7] to the baseline) brings no improvement, which suggests that the per- 
formance boost is not a result of a larger network. In fact, SPG and the learnable 
downsampler marginally increase the model size as they only combine for 27K 
parameters. Rows 4 and 5 show that using hand-crafted 2x resolution images 
(from bilinear, nearest neighbor upsampling) does provide some gain. In these 
experiments, we replace SPG with different interpolation methods and the higher 
resolution images undergo 3 x 3 convolutions before being passed as skip connec- 
tions to the decoder. However, because the 2x representation is not learned, as 
it is with SPG, the result is still x6% worse than our full model. Our learnable 
downsampler (LD) contributes 4.4% to our performance (row 6) as removing LD 
and replacing it with bilinear interpolation smooths lesion boundaries, resulting 
in loss of details. Finally, we justify the use of cross entropy for our loss func- 
tion; row 7 demonstrates that minimizing a soft Dice loss, as in [31,32], results in 
worse performance. The best performance is achieved with our full model using 
SPG and LD, and minimizing cross entropy (row 8). 


5 Discussion 


We propose SPiN, a network architecture that learns a spatially increasing 
embedding that, when used as guidance for an encoder-decoder network, helps 
ensure that small structures are not lost through spatial downsampling in the 
encoder. We note that our embedding does not create extra spatial information 
(data processing inequality), but serves as a means for better characterization of 
local regions for the downstream segmentation task. While we outperform exist- 
ing works and improve on small lesion segmentation, we do cost more memory 
and compute than the baseline. However, the extra cost is within reason (1 GB 
of memory for training and ~% 0.7s in runtime) and does not limit applicability. 
Despite the improved segmentation performance, we would like to address that 
there is still room for improvement, especially with small lesions. The highest 
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recall of 0.347 achieved by our model is admittedly low compared to recall met- 
rics on the full dataset, implying that many small lesions still pass undetected. 
We note that this is one of the first works to study subpixel architectures in 
lesion segmentation, and we hope our optimistic results will motivate further 
exploration in this direction. 
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Abstract. Automated brain tumor segmentation is challenging given 
the tumor’s variability in size, shape, and image intensity. This paper 
focuses on the fusion of multimodal information coming from different 
Magnetic Resonance (MR) imaging sequences. We argue it is important 
to exploit all the modality complementarity to better segment and later 
determine the aggressiveness of tumors. However, simply concatenating 
the multimodal data as channels of a single image generates a high vol- 
ume of redundant information. Therefore, we propose a supervoxel-based 
approach that regroups pixels sharing perceptually similar information 
across the different modalities to produce a single coherent oversegmenta- 
tion. To further reduce redundant information while keeping meaningful 
borders, we include a variance constraint and a supervoxel merging step. 
Our experimental validation shows that the proposed merging strategy 
produces high-quality clustering results useful for brain tumor segmen- 
tation. Indeed, our method reaches an ASA score of 0.712 compared 
to 0.316 for the monomodal approach, indicating that the supervoxels 
accommodate well tumor boundaries. Our approach also improves by 
11.5% the Global Score (GS), showing clusters effectively group pixels 
similar in intensity and texture. 


Keywords: Brain tumor - Supervoxel - Merging - Graph - Clustering 


1 Introduction 


Identifying the edges of brain tumors and observing their evolution is critical to 
accurately assess disease progression and thus better guide the patient’s treat- 
ment plan [9]. 

There is a multiplicity of brain imaging techniques, starting from the dif- 
ferent Magnetic Resonance Imaging (MRI) sequences, providing complementary 
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information about brain tumors. However, multi-modality makes tumor segmen- 
tation, i.e., delineating the tumor’s edges and quantifying the tumor’s size, more 
complex. Commonly used sequences include T1, T2, FLAIR, and T1-weighted 
contrast-enhanced (TICE). The visibility of glioma in the various sequences 
(modalities) is different. In the TICE image, regions of the brain are similar 
to the tumor Edema region. In the T1CE, the active and necrotic regions of a 
tumor can be clearly distinguished. The intensities of edema and tumor regions 
are higher in the T2 sequence images and the FLAIR images, whereas the inten- 
sities of CerebroSpinal Fluid (CSF) are higher in the T2 and lower in FLAIR 
images. To sum up, one modality can present weak tumor edges but strong tumor 
features, while another may have strong edges but weak features. Many of the 
existing algorithms for brain tumor analysis focus on a single modality (e.g., a 
specific MRI sequence), limiting the available information to be exploited for 
segmentation. 

Conversely, multimodal information can make the delineation and quan- 
tification more accurate, thanks to the modalities’ complementarity. However, 
simultaneous processing different MRI sequences comprised of millions of voxels 
induces a significant increase in computational time. To tackle this problem, we 
propose to oversegment the original sequences with the idea to process super- 
voxels with similar information instead of the individual pixels. The concept 
of superpixel was originally introduced in [1] as a small homogeneous group 
of neighboring pixels. Hereafter, we refer to a supervoxel as an extension of a 
superpixel in the 3-D multi-modal setting. 

We propose a two-stage unsupervised supervoxel-based approach. The first 
stage, performs an over-segmentation of the multimodal image with a supervoxel 
approach that approximates the boundaries of tumors and other objects in the 
multimodal image. The supervoxels are computed using an adaptation of the 
Scalable Simple Linear Iterative Clustering (SSLIC) algorithm [13]. Our adap- 
tion adds on a local regularity coefficient based on the variance [6] within the 
SSLIC algorithm. The coefficient increases the spatial constraint for supervox- 
els having high-intensity variances, and reduces it in areas with lower variances. 
Thereby, it allows supervoxel boundaries to capture perceptible objects with lim- 
ited intensity variations. The second stage fuses multimodal supervoxels with a 
merging algorithm inspired by Fu et al. [5] to reduce the supervoxels’ redundancy 
and their number prior to any classification task. 

We evaluated our method on the publicly available multimodal BraTS 2020 
dataset, which is a standard brain tumor segmentation benchmark [16]. Exper- 
iments show that the proposed merging produces highly accurate clusters com- 
pared to traditional monomodal approaches, thanks to the complementarity 
between modalities. We also demonstrate that using the local regularity coef- 
ficient allows generating more regular clusters on textures, better guiding the 
merging procedure. In the resulting segmentation after merging, the redundancy 
is reduced by a factor of 35 and the obtained supervoxels adhere very well to 
tumors boundaries. 
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2 Related Work 


Brain tumor and lesion segmentation is often formulated as a pixel-wise seman- 
tic segmentation problem addressed with supervised learning approaches [4]. 
Among them, Convolutional Neural Networks (CNNs) have emerged as the cur- 
rent best-performing methods [15] taking different forms: 2D CCNs [2,18], 3D 
CNNs [3], or extended to Fully convolutional [12] or multimodal approaches [23]. 
Despite their good performance, pixel-wise methods suffer from high computa- 
tional complexity due to the significant number of redundant pixels, particularly 
when dealing with multimodal images. This complexity affects both classical and 
learning-based algorithms. In the case of CNNs, multimodal images may require 
higher capacity networks, prone to overfitting if the training dataset is small. In 
this work, we take a step aside from pixel-wise semantic segmentation and focus 
on the unsupervised early fusion of multimodal information. 

Compared to pixels, superpixels are more consistent with human visual cog- 
nition, contain less redundancy, and reduce noise. Superpixels generally allow to 
significantly improve the speed compared to pixel-based algorithms by analyz- 
ing pixels clusters [7]. These properties are useful for computationally expensive 
tasks, such as brain tumor segmentation in multi-sequence MRI images. Most 
superpixel-based algorithms cluster the image into a high number of redundant 
superpixels (called oversegmentation) by adding cuts to a graph or growing from 
predefined seeds [24]. Superpixel methods combined with conventional machine 
learning approaches have been used for brain tumor segmentation, demonstrat- 
ing to be fast and robust to noise, initialization, and intensity non-uniformity 
[10,20]. However, these approaches neglect multimodal information in the super- 
pixel step. Ignoring multimodality leads to of lack of adherence with weak bound- 
aries, as noticed by Wang et al. [25]. Therefore, we opt for combining multimodal 
acquisitions, taking advantage of the complementary information to detect more 
detailed tumors structures and better adhere to borders. 

Regarding other multimodal methods for brain tumor segmentation, Rahim- 
pour et al. [19] compare early and late CNN fusion, favoring late fusion as it 
does not need an initial registration step. In our work, we opt for an early but 
unsupervised fusion which assumes pre-registered modalities. Soltaninejad et al. 
[22] also proposed an early multimodal fusion approach to produce supervoxel 
boundaries across multiple MR sequences, enforcing adherence to weak struc- 
tures boundaries. However, similar to the monomodal case, the algorithm results 
in a large number of redundant superpixels, which unnecessarily increases com- 
putation time and can lead to a higher false-positive rate. For this reason, we 
propose two contributions to reduce supervoxels redundancy in the multimodal 
case. First, a variance constraint inspired by the work of Giraud et al. [6], pro- 
posed in the context of natural images to better account for textured regions; 
and second, a supervoxel merging step. 

Outside the brain tumor segmentation literature, there has been interest 
in superpixel and supervoxel merging approaches. Luengo et al. [14] proposed a 
method that achieves high segmentation performance while reducing the number 
of redundant superpixels in the image, based on an iterative splitting and merg- 
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ing algorithm. Focusing on the scale, Fu et al. [5] introduced a multiscale app- 
roach for superpixel merging in the RGB color space. The method uses multiple 
features to calculate a dissimilarity score between pairs of superpixels, including 
color, texture, and common border length. Moreover, it simplifies the merging 
graph to accelerate the merging procedure. For these two reasons, which are 
relevant in our multimodal MRI case, we rely on Fu’s multiscale approach for 
supervoxel merging. Our experimental validation shows qualitatively and quan- 
titatively the pertinence of our two contributions: the variance constraint and 
the merging approach. Our approach combining multimodal supervoxels, the 
variance constraint, and the merging step, improves tumor boundary adherence 
and significantly reduces supervoxel redundancy. 


3 Methods 


Let multiple images of the same anatomy be acquired with different modalities 
and then registered to form the multimodal image I = [f, I2, . . . , Im]. Lis a 3-D 
volume whose every voxel contains an M-dimensional vector. Our goal is to find a 
n 
single partition S of non-overlapping supervoxels S;, such that, S = (J S; taking 
into account intensities and borders in all modalities. To this end, se ropes a 
two-steps method. First, an initial oversegmentation is performed with the SSLIC 
algorithm [13], refined with a variance constraint to better model the texture. 
As a result we obtain an initial supervoxel clustering (See Sect. 3.1). However, 
the oversegmentation can lead to a substantial number of supervoxels even for 
a small tumor. This creates a burden for later tasks, such as classification. To 
reduce the final number of supervoxels, a second step is necessary. Inspired by 
the work of Fu et al. [5], we construct a graph G over the oversegmentation and 
merge similar vertices to obtain a more meaningful segmentation (See Sect.3.2). 


3.1 Oversegmentation Based on Supervoxels 


Supervoxels are irregular image blocks composed of adjacent voxels with simi- 
lar texture, intensity, and brightness features. Currently, there are two common 
types of supervoxel segmentation algorithms. The first one is based on graph 
theory and the second on Gradient Ascent. To the later category belongs the 
well-known Simple Linear Iterative Clustering (SLIC) approach [1] and its ITK 
version [8]. We rely on SSLIC with multimodal features [13] to obtain a first over- 
segmentation of the image. By multimodal features we mean that each voxel is 
characterized by an M-dimensional vector containing the intensities for that pixel 
across all modalities. First, an initial clustering is given and then the clustering 
is improved iteratively until convergence (refer to [13] for details). 

We propose an adaption of the SSLIC algorithm (SSLICvar), that modu- 
lates the supervoxel compactness according to the supervoxels feature variance. 
Initially introduced by Giraud et al. [6] in the context of natural imaging in 
2D, we bring this constraint to the medical image analysis field, extending it 
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for the M-dimensional case. The standard SSLIC framework [13] only requires 
the number of superpixels and a single parameter m. In our adapted version, 
each supervoxel S; has a different parameter m; setting its shape regularity 
(i.e. compactness). This parameter is computed according to the mean feature 
(luminance in our case) variance per supervoxel across modalities: 


Mi = M * exp (2) (1) 


€ 


where o? (Fmoa) is the luminance variance within the supervoxel S; in a modality, 
= is the mean operator and c€ is a scaling parameter. At the output of this step, 
we have an oversegmentation of our 3D multimodal volume I. 


3.2 Supervoxels Merging 


The oversegmentation produced by the supervoxel-based method already reduces 
some redundant information. However, the SSLIC approach is sensitive to the 
seeds initialization, which constraints the final number of clusters. Flat objects 
in the image, such as tumors exhibiting low texture and small intensity variation, 
are still composed of redundant supervoxels. With the aim of further reducing 
the redundancy, we use a method inspired from the work of Fu et al. [5] and 
apply it in the context of multimodal MRI. The oversegmentation is transformed 
into a Region Adjacency Graph (RAG) G = {V,€}, with the set of vertices V = 
{v1, v2,...,Un} and n the number of supervoxels. Edges € represent connections 
between adjacent supervoxels and their weights denote the dissimilarity based 
on the intensity and texture features. The dissimilarity of two supervoxels 7 and 
j, named w; j, is defined as Eq. 2. 


(2 Peli. D+8 Deli.5) y2 
Wi j = exp ( ane } (2) 


Y 


where De(i, j) and D(i, j) are the intensity and texture dissimilarities, a and 
6 their respective adjustable weights, and y governs how close to each other 
features are. More specifically, 


M 
De(i, j) = 5 AYmoa(i, j), (3) 
mod=1 
where AYmoali, j) = (Yfoa — Youu)? and Yiog, YŻ q are the average luminance 
values in the i*” and j*” supervoxels respectively. D;(i, j) is the texture dissim- 
ilarity computed in [5] as : 


M 
D, (i,j) m 5 AHmoali, j), (4) 


mod=1 
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where AHmoa(i, j) is the Manhattan distance between the histograms of super- 
voxels 7 and j as in [5]. The distance measures were normalized in a range of [0; 1] 
to be efficiently combined. Some brain tissues, as is the case of tumors, have lower 
textures and high intensity, which can result in an imbalance between intensity 
and texture features. Because of these complex cases, the adjustable weights 
from Eq.2 were manually adjusted to better split the dissimilarity between nor- 
mal and tumor tissues as defined in Sect. 4.3. Once the dissimilarity measures 
over supervoxels and graph weights are computed, the supervoxel merging algo- 
rithm takes place to reduce information redundancy and achieve finer clustering. 
However, the Region Adjacency Graph (RAG) connects each supervoxel to all 
its neighbors. So, it is very computationally expensive to directly start merging 
the nodes with high similarity since the number of edges and nodes is still too 
large. To accelerate the merging process, a Nearest Neighbor Graph (NNG) [17] 
is determined based on the RAG. The NNG efficiently determines paired super- 
voxels that are the most similar. Here, the NNG is calculated using the Kruskal 
algorithm [11], which significantly reduces the number of edges and overall the 
search space, allowing for a more computationally efficient merging. The merg- 
ing algorithm is iteratively computed until no edges in the NNG have weights 
inferior to a given threshold 7 which is defined as in Eq. 5: 


Fa 2; (min e; — o(e;)) 


(5) 


with e; one of the edges connected to supervoxel i, that is, e; € {wij}, j © Ni 
and o denotes the standard deviation. 


n 


4 Experiments 


4.1 Experimental Setup 


Experiments are performed on the publicly available multimodal BraTS 2020 
dataset, which is a standard brain tumor segmentation benchmark [16]. The 
dataset is composed of real brain MRI exams including T1, TICE, T2, and 
FLAIR sequences, acquired from 19 institutions for 369 subjects. The ground 
truth is provided for each exam in form of contours manually delineated by 
experts. Three tumor subregions were annotated: contrast-enhancing tumor, 
non-enhancing/necrosis combined, and edema. Images are 3D volumes with a 
size of [155 x 240 x 240] (DxWxH) and an isotropic resolution of 1 mm. The 
sequences from the dataset are co-registered to the same anatomical shape and 
skull-stripped by the BraTS maintainer. Images are cropped to remove the 
background area at the edges and normalized independently for each modal- 
ity between [0; 1]. 


4.2 Quality Assessment Methods 


We use several reference (using ground-truth) and no-reference segmentation 
assessment metrics to evaluate the performance of the proposed unsupervised 
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segmentation method in delineating tumor tissues and keeping meaningful vox- 
els disparities. The Achievable Segmentation Accuracy (ASA) score is com- 
puted in the tumor’s region to assess the accuracy of the supervoxels boundaries 
with respect to the ground truth. The wVar and Moran’ Index (MI) quantify 
respectively the disparity within and between clusters. More precisely, the wVar 
assesses the luminance disparity of within each cluster, while MI is a spatial 
autocorrelation measure characterizing the degree of similarity among supervox- 
els. Since the SSLIC oversegmentation is highly redundant, MI is an effective 
measure to show the advantage of the merging approach. The best value for 
wVar and MI is 0 which indicates the absence of redundancy. The Global Score 
(GS) is defined as the average of wVar and MI and is used as a final metric 
with ASA. We also use the number of supervoxels in the image (Supervoxel 
count) to quantify the improvement brought by the merging algorithm. For the 
no-reference metrics, in the monomodal setting, the final results are computed 
as an average through all modalities for all subjects. In the multimodal setting, 
the final results correspond to the average across all subjects. Since the wVar et 
MI scores provide one measure per modality, we keep the minimal value for each 
supervoxel across modalites. The evaluation is done in this way to put forward 
the discriminative power of the different modalities. The other scores (ASA and 
count) directly provide a single measurement per subject. 


4.3 Implementation Details 


SSLIC and merging algorithms are dependent on input parameters. The qual- 
ity of the output clustering with SSLIC depends on the parameters K and m. 
K is the number of supervoxels, which in our case is defined as the smallest 
desired isotropic supervoxel size K = [10, 10, 10]. As multimodal images are nor- 
malized independently between [0,1], the compactness factor m is defined at 
0.1. This value better balances intensity and spatial features as spatial features 
are not normalized to the range [0,1]. The variance parameter € used to bal- 
ance the influence of the variance on the local compactness is set to 0.01. The 
hyperparameters a, 8, y have been empirically defined at 0.5, 0.1, and 0.1 to bal- 
ance feature importance. Several orders of values have been tested to retain the 
parameter set with higher ASA. The parameters used in the histogram texture 
similarity are set to 32 for the number of bins, 8 for the number of angles, and 
10 for the histogram bin size. The whole process takes around 40s for an image 
of shape [4 x 155 x 240 x 240] with the first axis corresponding to the number of 
modalities M. The SSLIC algorithm and the feature extraction were computed 
on 12 threads with 32 GB of memory. 


4.4 Experimental Results 


In our experiments, we assess the benefit of exploiting multimodal information 
in computing supervoxels, the effectiveness of including variance as a regularity 
coefficient in the SSLIC and the impact of the merging algorithm relying on 
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Fig. 1. The first column is an axial cross-section over 3 MRI sequences: T1 (A), T1-CE 
(B), T2 (C). The second column (D, E, and F) are the supervoxels computed using 
Mono_SSLIC on the 3 modalities independently. The third column corresponds to 
the result of the merging procedure applied on the previously computed supervoxels 
on each modality (Mono SSLICmergea). In the last column, J corresponds to the 
resulting segmentation of Multi -SSLIC computed on the 3D volume I composed of 
the different modalities, K is the result of SSLIC computed on I with the local reg- 
ularity coefficient (Multi_SSLICvar) and L is the proposed method including multi- 
modal SSLIC followed by the merging procedure with the local regularity coefficient 
(Multi_SSLICvar_mergea). The ground-truth overlay is represented by green, red, and 
yellow (Edema, necrosis, and active tumor). (Color figure online) 


Fig. 2. (TOP) Original multimodal images zoomed in around the tumor region. 
Modalities are T1 (A), TICE (B), T2 (C), and the ground truth (D). (Bottom) 
Multi SSLIC, Multi_SSLICMergea, Multi_SSLICvar and Multi_SSLICVar_mMerged 
(E-H). Blue and red squares show local adaptive regularity influence on supervoxel 
homogeneity and compactness. The ground-truth overlay is represented by green, red, 
and yellow (Edema, necrosis, and active tumor). (Color figure online) 
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Table 1. Performance measurements computed with our own implementation of the 
scores added to superpixel benchmark [24] 


Monomodal method ASA wVar MI GS Supervoxel count 
Mono_SSLIC -625 -314 -398 .356 8629.770 
Mono-SSLICyar .676 .346 -362 354 8233.930 
Mono-SSLIC Merged .290 314 220 267 204.625 
Mono-SSLICVar Merged -316 -348 -318 -333 203.904 
Multimodal method ASA wVar MI GS Supervoxel count 
Multi-SSLIC -687 493 -417 -455 8626.370 
Multi_SSLICvar -648 -505 -383 .444 8163.21 
Multi-SSLIC Merged .673 .483 .337 .409 301.417 

Multi SSLICVar Merged .712 .458 .349 .403 298.42 


colors and textures features on the segmentation accuracy. To this end, we com- 
pare 4 unsupervised segmentation methods applied in both the monomodal and 
the multimodal settings: SSLIC applied without (SSLIC) or with (SSLICvar) 
the adaptive local variance regularity coefficient, SSLIC followed by the merg- 
ing step without (SSLICwergea) or with the adaptive local variance regularity 
coefficient (SSLICVar Merged; Ours). The former methods are applied both in 
monomodal (Mono) and multimodal (Multi) settings. 

Figures 1 and 2 show some qualitative results of applying the 4 segmentation 
methods to one subject with 4 modalities FLAIR, T1, TICE, and T2. To fur- 
ther illustrate the performance of the proposed approaches, we report in Table 1 
several quality metrics computed on the segmentations obtained in both the 
monomodal and multimodal settings. 


The Benefit of Multimodality. As depicted in Fig. 1 J, applying the segmen- 
tation on multimodal images successfully takes into account the heterogeneous 
information from different modalities to cluster the image. On the contrary, in 
Fig. 1 D-F (results generated from Mono_SSLIC), the clusters do not adhere 
completely to the ground truth tumor boundaries on the T1 and T2 modalities, 
since the complete information concerning the tumor is not fully present and 
multimodal information can not be efficiently exploited. In Fig. 1 G-I, we show 
the results of the merging applied independently on the three modalities with 
ground-truth overlay. It is clear that the T2 modality gives more information 
about Edema tissue whereas T1CE further characterizes the tumor’s tissue. A 
more accurate clustering of the tumor can be seen in Fig. 1 J-L. 

As shown in Table 1, the multimodal approaches i.e. Multi_SSLICya, and 
Multi-SSLICVar_Mergea perform better in terms of ASA compared to the 
monomodal approaches. Multimodal clustering exploits all the available infor- 
mation from different modalities and produces an accurate segmentation. We 
found that the best performing approach is the Multt_SSLICVar_Mergea Which 
improves the clustering accuracy by 5.2% for the ASA Score and 25% for the GS 
with multimodal information. Indeed, all modalities give different complemen- 
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tary information about tissues. Thereby, using all available information to merge 
supervoxels while keeping important tissue properties, such as tumors texture, 
improves qualitative results as well as ASA, and GS scores. 


Impact of Locally Adapting the Superpixel Regularity. Including vari- 
ance inside the SSLIC algorithm allows to automatically adapt the regularity 
coefficient to highly textured supervoxel s and high-intensity supervoxels with- 
out manually adapting m. This makes the supervoxels more homogeneous as 
well as more compact, resulting in a better final clustering accuracy as shown 
in the Fig. 2. Blue and red squares in Fig. 2 F and H (Multi -SSLIC Merged and 
Multi SSLICVar Merged), Show the influence of using the local regularity coef- 
ficient on the compactness of the merged supervoxels. The resulting supervoxels 
are more compact and differ from their neighbors. We can see in the red square of 
Fig. 2 H that supervoxels have been correctly computed with more compactness 
and have been merged into a bigger supervoxel. Furthermore, from the quanti- 
tative results in Table 1, we can see that the local adaptive regularity coefficient 
*Var improves the results in terms of accuracy (ASA) and GS for the methods 
applied in both monomodal and multimodal settings (excepts for Multi- SSLIC 
and Multi- SSLICvar). The variance of the supervoxel is an important factor 
to take into account in the segmentation algorithm. The MI is almost the same 
for both Multt_SSLICMergea and Multi.SSLICVvar Mergea demonstrating the 
robustness of the merging step to variance’s disparity across supervoxels. 


Performance of the Merging Algorithm. In the monomodal setting, in 
a modality where tumor tissues are not distinct, merging similar neighboring 
supervoxels reduces the tumor boundary accuracy. For example, in Fig. 1 H, 
supervoxels computed independently on the TICE modality are not accurately 
merged since this modality highlights only the active tumor while other tumor 
tissues are not visible. This results in a poor ASA score for T1CE, therefore 
penalizing the final average ASA score. As such, computing the average ASA 
across modalities highlights the lack of the multimodal discriminant power (mak- 
ing use of visible tumors parts in all modalities). The merging approach applied in 
the multimodal setting is capable of reducing the number of supervoxels by a fac- 
tor of 35 (column “Supervoxel count” in Table 1) and decreasing the redundancy 
(MI) by 0.21% in average compared to the initial oversegmentation). The texture 
homogeneity inside the merged supervoxels has been kept which demonstrates 
that our algorithm merges similar supervoxels. It is also interesting to note the 
wVar obtained on the results of applying Mono_SSLIC or Mono_SSLIC Merged 
is approximately similar. This can be explained by the fact that the clustering 
was initially correct for the Mono_SSLIC step without merging. 


5 Conclusion 


In this work, we proposed a novel approach of merging supervoxels in a multi- 
modal setting towards brain tumor classification. We showed that our methods 
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applied on multimodal images are capable of exploiting the complementarity 
between different modalities producing very accurate clusters compared to tradi- 
tional monomodal approaches. Our approach Multi SSLICVar Mergea improved 
the clustering accuracy by 5.2% for the ASA Score and 25% for the GS. The 
redundancy of supervoxels is also reduced by a factor of 35, decreasing the com- 
putational time, and making the resulting oversegmentation more suitable to 
be combined with a neural network classifier. Several open questions remain 
to be tackled in a future work. First, one drawback of the proposed approach 
is its dependency on prior registration of multiple modalities. Bipartite Graph 
Matching [21] seems to be an efficient way to alleviate this constraint. Moreover, 
taking into account radiomics and deep features in the computation of the super- 
voxels could also improve the adherence of initial over-segmentation or merged 
supervoxels to contrasted tissues, therefore resulting in more homogeneous final 
clustering. 
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Abstract. The problem of tumor growth prediction is challenging, but 
promising results have been achieved with both model-driven and statis- 
tical methods. In this work, we present a framework for the evaluation 
of growth predictions that focuses on the spatial infiltration patterns, 
and specifically evaluating a prediction of future growth. We propose 
to frame the problem as a ranking problem rather than a segmentation 
problem. Using the average precision as a metric, we can evaluate the 
results with segmentations while using the full spatiotemporal predic- 
tion. Furthermore, by applying a biophysical tumor growth model to 21 
patient cases we compare two schemes for fitting and evaluating predic- 
tions. By carefully designing a scheme that separates the prediction from 
the observations used for fitting the model, we show that a better fit of 
model parameters does not guarantee a better predictive power. 


Keywords: Glioma - Growth model - Validation - Magnetic resonance 
imaging - Brain 


1 Introduction 


As the diagnosis and delineation of glioma has improved with machine learning 
[4], researchers look towards the more challenging task of predicting the dis- 
ease trajectory into the future [8,19]. However, the problem of tumor growth is 
challenging in many ways, not just by the lack of publicly available data. The 
variables of clinical importance, such as the speed of infiltration and prolifer- 
ation, are unknown and the problem of estimating them from observations is 
ill-posed. Furthermore, the observations we do have are flawed as tumor cells 
are known to spread beyond the visible boundary on MR imaging [22]. 

Despite these challenges, biophysical growth models have shown promise in 
their ability to predict the spatial growth patterns for individual cases. They 
are model-driven and strongly rooted in a mechanistic understanding of tumor 
growth. Delineations of the tumor on MR imaging typically form the input for 
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individual model fitting, with follow-up imaging providing the gold standard of 
evaluation. Though other methods of evaluation exist, such as biopsy samples 
[10] or PET imaging [18], for most clinical cases consecutive delineations are the 
best approximation for a ground truth. 

Due to the nature of the data, growth predictions are often framed as a seg- 
mentation problem. For example, by using an overlap metric such as the Dice 
Similarity Coefficient based on a sample in time [7,19]. Although this metric 
comes natural to the ground-truth data, it is less representative of the underly- 
ing problem. The main disadvantage of overlap-based metrics is that they treat 
all voxels equally, while some errors are more significant than others. Intuitively, 
we would want to assign more significance to false negative predictions at a 
large distance to the predicted tumor boundary as they represent a larger dis- 
agreement to the model and would likely require a large adjustment to predict 
correctly. This intuition is represented in metrics based on the segmentation 
boundary, such as the symmetric surface distance used in Konukoglu et al. [17]. 
But even a distance metric compares only to a single point in time, and using a 
boundary metric becomes less appropriate when the ground truth contains new 
disconnected lesions. 

Another challenge in the evaluation of tumor growth predictions is the entan- 
glement of model fit and prediction. All tumor growth models require an initial 
observation to fit model parameters. The goodness-of-fit is measured using the 
segmentation on this initial observation and the prediction is performed from the 
time of onset, through the initial observation towards the future [3]. The opti- 
mization of this inverse problem is an important topic for research, not in the 
least because the growth parameters can be of prognostic value by themselves 
[21], but often these methods are evaluated in simulated data. The clinical real- 
ity will not adhere to the strict assumptions made in the model, and therefore 
the predictive value of the model depends not only on the effectiveness of the 
model fitting but also on the correctness of the assumptions. 

An ideal test of a prediction model would require a strict separation of model 
fitting and evaluation. However, in the problem of personalized tumor growth 
models this separation is not strictly possible because the initial condition used 
for the parameter fit is also part of the final tumor shape used for evaluation. 
Especially with models that simulate the full growth trajectory, there is a risk 
that model fit on the initial condition is strongly entangled with the prediction of 
growth. After all, if the shape of the initial lesion is not estimated correctly then 
this error will propagate to the estimation the future disease trajectory. This 
work explores the distinction between goodness-of-fit at the initial time-point, 
and predictive performance for future time-points by comparing two temporal 
evaluation schemes, one of which aims to strictly separate the initial condition 
from the predicted growth behavior. 

In this work we propose the following contributions: 


1. A novel framing of tumor growth as a ranking problem, with the Average 
Precision as the performance metric 
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2. The application of this evaluation framework on a biophysical tumor growth 
model and a dataset of 21 patient cases, to explore the relation between 
goodness-of-fit at the initial time-point, and predictive performance for future 
time-points. 


2 Methods 


2.1 Tumor Growth as a Ranking Problem 


In this section, we propose that tumor growth prediction could be framed as a 
ranking problem, aimed at predicting the relative time-to-invasion of each voxel 
in the brain. Based on this perspective, we propose an evaluation metric for 
assessing the quality of the predictions (i.e., rankings) resulting from any growth 
model. This problem formulation is aimed at predicting infiltrative growth in a 
spatial sense, and simplifies the problem by disregarding the speed of growth 
and potential mass effect. 

We assume that a growth model could produce a segmentation of the tumor 
S(t) at any time t > 0. It may therefore assign to every location in the brain a 
time T(x), which is the first time t when the tumor reaches that location. As we 
do not require an accurate estimation of the growth speed, we require only that 
the estimated T(x) is a ranking of voxels in the brain, such that: 


T (ta) > T(x) @ At: za ¢ S(t), x E S(t). (1) 


The ranking can be evaluated by a sampling of the ground-truth segmen- 
tation S’, by using the Average Precision (AP). The AP is defined as the area 
under the Precision-Recall (PR) curve: 


AP = 3,(R(t) — R(t- 1)) P(t), (2) 


where R(t) and P(t) are the recall and precision at a threshold t on the time-to- 
invasion ranking T, leading to the predicted segmentation S(t) = {x : T(x) < t}, 
and comparing to the reference segmentation S’: 


pt) = FORT re = BORSI (3) 


The AP metric weighs the precision scores are with the difference in recall, 
so that all tumor volume predictions S(t) are taken into account from the tumor 
onset to the time when the recall is 1. This is when the ground-truth segmenta- 
tion is completely encompassed by the prediction S(t). An evaluation based on a 
single time t would represent a point on the PR curve. If we take a volume-based 
sample, where the estimated tumor volume equals the observed tumor volume, 
i.e. |S(t)| = |S], this is the time t where R(t) = P(t). 

Formulating the problem as a ranking and using the AP has a number of 
qualitative advantages. First, the ranking T has a direct local connection to 
the speed of the tumor boundary. If the ranking is smooth, the gradient of the 
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T represents the local movement of the visible tumor boundary. It automati- 
cally assigns a larger weight to certain parts of the prediction, depending on the 
assigned ranking T, regardless of any assumptions on the significance of distance 
in space or time. We might quantify the agreement between T and S locally by 
using the rank of the voxel T(x) as a threshold on the PR curve. A local predic- 
tion T(x) is in agreement with S” if it is part of the ground-truth segmentation 
(x € S’) and can be included with high precision P(T(x)), or else if it falls 
outside S” but can be excluded with high recall R(T(x)). Figure 1 illustrates the 
computation of the AP metric and this local measure of disagreement. 
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Fig. 1. Left: cross-section of tissue segmentation of a specific case with thresholds on 
the T map, generated by a tumor growth model, indicated as segmentation boundaries. 
The ground-truth segmentation S’ is indicated by a red overlay. Middle: corresponding 
Precision-Recall curve with the same thresholds indicated. The sample with a corre- 
sponding volume is marked on the PR curve. Right: quantification of agreement by 
R(T(zx)) outside S’ and P(T(x))) for voxels inside S’. 


2.2 Example Growth Model 


To illustrate the the proposed framework for evaluating tumor growth predic- 
tions, a traditional diffusion-proliferation model was used with anisotropic dif- 
fusion, informed by diffusion tensor imaging (DTI). This model is intended to 
illustrate the use of the evaluation framework, but it is not our aim to present a 
novel or improved growth model. The model is defined by a partial differential 
equation for the cell density c, which changes with each timestep dt according 


to: d 
~ V(DVc) + pc(1 — ©), (4) 


DVc:- nse = 0, (5) 


where p is the growth factor, nso is the normal vector at the boundary between 
the brain and CSF, and D is a tensor comprising an isotropic and anisotropic 
component: 

D = k(x) + TF(x)T(x), (6) 
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where « and 7 are parameters to weigh the two components, I is the identity 
matrix, F(x) is the local Fractional Anisotropy (FA) and T is the normalized 
diffusion tensor [11]. 

The isotropic diffusion depends on the local tissue type [14], as defined by 
a separate parameter Kw and Kg for voxels in the white matter (W) and grey 
matter (G) respectively: 


Kw LEW 
k(x) = 
Kg LEG 


To go from a prediction of c(t, x) to a time-to-invasion ranking T(x), a thresh- 
old cy, is applied at each iteration such that T(x) = min; c(t, x£) > cy, where the 
visibility threshold is set as c, = 0.5. The initial condition of the model is pro- 
vided by an initial cell density c(t = 0), which can be defined in two ways: 1) 
as a gaussian distribution centered at a location x, and a standard deviation of 
Imm; 2) based on a segmentation by setting the cell density at c = c, for voxels 
inside the segmentation [7]. 

The model was implemented in FEniCS [1] in a cubic mesh of 1mm isotropic 
cells, using a finite element approach and Crank-Nicolson approximations for 
the time stepping. It has four unknown parameters (p, T, Kw, Kg) and, in case 
of the first approach for setting c(t = 0), an initial location zs. The method for 
fitting £s is explained below. 


Fit of Initial Point. A fit of the point x, is essential for the model initial- 
ization from tumor onset, and its location depends on the model parameters. 
Konukoglu et al. [17] have shown that an eikonal approximation can effectively 
mimic the evolution of the visible tumor boundary. In this work, we use an 
eikonal approximation that assumes the visible tumor margin moves at a speed 
v of v = 4,/pTr(D), in order to estimate x, for a given set of model param- 
eters, by optimising the approximation of the initial tumor So in terms of the 
Dice overlap at equal volume using Powell’s method [20]. To be more robust 
to the optimization seed, considering that the optimization landscape may have 
mutliple local minima, the optimization was repeated for ten runs with different 
random seeds to increase the chance of finding the global optimum for zs. 


3 Experiments 


3.1 Dataset 


A retrospective dataset was selected from Erasmus MC of patients who a) were 
diagnosed with a low-grade glioma; b) were treated with surgical resection, but 
received no chemo- or radiotherapy; and c) had a DTI and 3D T1-weighted scan 
before resection, and two follow-up scans (before and after tumor progression). 
This resulted in data of 21 patients, after one dataset was excluded due to failed 
registration. Note that the time difference between the measurement of initial 
tumor and the two follow-up scans varied from a few months to several years. 
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3.2 Temporal Evaluation Schemes 


In the typical timeline of fit and evaluation [14,17], described in Fig.2 as the 
bidirectional scheme, the model is fitted on a tumor segmentation So and then 
simulated from onset, through So, to the point of evaluation S2. In other words, 
the prediction contains the behavior that it is fitted on. 

We compare this method to a strictly forward evaluation scheme that sep- 
arates the model fit from the prediction as much as possible. As described in 
Fig. 2 as the forward scheme, the parameters (in this case xs) are fitted on an 
initial time-point Sp and then used to make a prediction between two follow-up 
scans Sı and S2. By running the prediction from a segmentation S1 instead of 
an initial location £s, the potential error in fitting Sp does not propagate to the 
evaluation, which is based purely on the growth behavior between Sı and S2 
that is unknown when fitting the model. 


Resection ae : 
Bidirectional 


Onset 


Onset Forward Evaluation t 


Fig. 2. Overview of two temporal evaluation schemes. Bidirectional: a growth model is 
fitted to the initial tumor and simulated from a seed point to generate a voxel ranking 
T. Forward: parameters are fitted to the initial tumor and then the model is initialized 
with a segmentation Sı obtained after resection to generate the voxel ranking T. Images 
from left to right: example of tissue segmentation with So outlined, tissue segmentation 
with resection cavity removed and Sı outlined, example of final ranking T used for the 
evaluation with resection cavity and Sı removed, quantification of agreement between 
T(x) and S2. 


For our dataset, we need to consider the role of the tumor resection. In both 
schemes, the resection cavity as estimated by the aligment of the tissue at So 
and 51, is removed from the region of interest for evaluation. In the forward 
scheme, any voxels in the segmentation Sı are also removed from the region 
of interest, leaving only the new growth visible in Sj for evaluation. So where 
the bidirectional scheme evaluates predictive performance on the entirety of the 
remaining tumor, using So only to initialize the location of onset, the forward 
scheme evaluates purely predictive performance based on the knowledge of S4. 
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3.3 Data Preprocessing 


Running a growth model from onset requires knowledge of the underlying healthy 
tissue. Removing pathology from an image is a research problem in itself, but 
commonly a registration approach with a healthy brain - often an atlas - is used 
[5, 14,18]. In this study we used the contralateral side of the brain as a refer- 
ence for healthy brain structure (similar to [6]). This is possible because in our 
dataset all lesions were strictly limited to one hemisphere. Using a registration 
of the T1-weighted image with its left-right mirrored version, all segmentations 
were transferred to the contralateral healthy side of the brain. To prevent unre- 
alistic warping of the image due to image intensity changes in the tumor, while 
still capturing its mass effect, the b-spline registration was regularized with a 
bending energy penalty [16]. The weight of this penalty with the mutual infor- 
mation metric was tuned on a number of cases using visual inspection of the 
transformation. 

The model input is a segmentation of the brain, separated into white matter 
(W) and gray matter (G), potentially an estimate of the local diffusion based 
on Diffusion Tensor Imaging (DTI), and a binary segmentation of the tumor. 
Segmentations of the brain and brain tissue were produced using HD-BET [13] 
and FSL FAST [23] respectively. For the pre-operative images, which did not 
include a T2W-FLAIR sequence, Sg was segmented manually. Tumor segmenta- 
tions Sı and Sə for consecutive images were produced using HD-GLIO [12,15] 
and corrected manually where necessary. Alignment with the space of So was 
achieved with a b-spline registration, which was evaluated visually. Datasets were 
excluded if the registration did not produce a reasonable aligment. 

As no registration or segmentation will be perfect, some inconsistencies 
remain that prevent a perfect prediction. To not punish the model unfairly, 
the voxels in S falling outside the brain were disregarded in the computation of 
the AP metric. 


3.4 Parameters 


As the variation of diffusive behavior within the brain is a defining factor for 
the tumor shape, and from a single observation it is impossible to estimate all 
parameters simultaneously, we kept the proliferation constant at p = 0.01 while 
using the parameters Kw, Kg and T as parameters of interest. These parame- 
ters were not fitted but rather varied systematically, as listed in the legend of 
Fig. 3. For this range of seven growth model parameter settings, the AP per- 
formance was measured for goodness-of-fit on the baseline segmentation Sg and 
predictive performance on S2, according to the two evaluation schemes. The rela- 
tion between goodness-of-fit and predictive performance was quantified using 
a patient-wise Spearman correlation across different growth model parameter 
settings. The mean of the patient-wise correlation coefficients was tested for a 
significant difference from zero using a one-sample t-test. 
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4 Results 


Figure 4 shows two examples of the model input and results, in terms of the 
images used for tumor segmentation at the three timepoints, the segmentations 
and their mirrored counterparts and the results of a specfic model (sw = 0.1, 
Kg = 0.1 and 7 = 10) using both the forward and bidirectional evaluation scheme. 
The local values of R(T(x)) and P(T(«)) indicate where the model results are 
most in disagreement with the ground-truth segmentation Sp. 

Figure 3 shows a comparison of the goodness-of-fit, which is measured by the 
AP on the initial tumor segmentation So, and the final predictive performance 
on So. 
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Fig. 3. Comparison of goodness-of-fit versus predictive performance for the two evalua- 
tion setups. Results for the same patient on different parameter sets are interconnected. 


Comparing the performance between different growth model parameter set- 
tings, it is clear that goodness-of-fit is generally higher and more dependent on 
the model parameters than the predictive performance. From the growth model 
parameter settings, typically the best goodness-of-fit (AP on So) was achieved 
with low diffusion (kw = 0.01) while the worst fit was achieved when the dif- 
ference in « between white and gray matter was large (kw = 0.1, kg = 0.01 or 
Kg = 0.02). 

From the results of the bidirectional evaluation scheme, going from an intitial 
point through So to S2, it seems that there is a relation between the goodness-of- 
fit and the predictive performance. However, this relation disappears when using 
the forward evaluation scheme. These observations are confirmed by the mean 
patient-wise correlation coefficients, which were 0.24 (p=0.06) for the forward 
scheme and —0.03(p = 0.76) for the bidirectional scheme. 


108 K. A. van Garderen et al. 


Fig. 4. Example of image processing results for two patients. Top row: T2W imag- 
ing showing the initial tumor (left) and T2W FLAIR images showing the tumor after 
surgery (middle) tumor and at recurrence (right). Bottom row, left: T1W imaging with 
boundary of resection cavity (cyan), Sı (yellow) and S2 (red). Both the original seg- 
mentations and the mirrored segmentations are shown. Bottom, middle: Visualization 
of the local quantification of agreement by R(T(x)) outside S2 and P(T(«))) for voxels 
inside S2, for one parameter setting in the forward evaluation scheme. Bottom, right: 
same visualization for the bidirectional evaluation scheme, same parameter setting. 
(Color figure online) 
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5 Discussion 


This work presents a formulation of the tumor growth predictions as a forward 
ranking problem, and describes the Average Precision metric for its evaluation. 
By formulating the problem in this way we can evaluate the full spatiotemporal 
results, even if the observations are only snapshots in the form of a segmenta- 
tion. A further advantage is found in the direct link to local growth speed and 
quantification of the local model agreement. Though these advantages are only 
of a qualitative nature, and do not provide a direct benefit to the model itself, 
we believe it to be a useful step in the development and specifically evaluation of 
growth models. An important underlying assumption in this framework is that 
the time axis is not quantified, so the prediction does not provide information 
on the overall speed of growth or any potential mass effect. Predicting these 
factors is a highly relevant problem as well, but to predict both spatial distri- 
bution, mass effect and speed of growth would likely require at least multiple 
time-points for model fitting or additional clinical parameters. This is currently 
not feasible with the data available in clinical practice. For a model that does 
provide information on growth speed and mass effect, the AP metric could be 
combined with other metrics to separately evaluate the different factors of tumor 
growth. 

The importance of problem formulation is further illustrated with the two 
temporal evaluation schemes. Specifically for personalized tumor growth models, 
which are fitted to an initial tumor shape, this work presents an alternative 
forward scheme that separates the goodness-of-fit from the evaluation of future 
predictions. In the forward scheme, the model is initiated with a segmentation 
instead of an initial point of onset, so that errors made in fitting the initial 
tumor do not propagate to the final prediction. The aim of this scheme is to 
evaluate the predictive value of the model and its parameters separately from 
the goodness-of-fit at the initial observation. 

By comparing the bidirectional and forward evaluation schemes in a dataset 
of 21 patients, using a biophysical growth model, we show that the choice of 
evaluation greatly affects the relative performance of models. This is illustrated 
with different parameter setting of the same model, not with different models, 
but with the purpose of showing the difficulty of evaluating true predictive per- 
formance in general. In this case, for our specific model and parameter settings, 
the difference in performance between parameter settings can be attributed to 
a better fit of the initial situation, and not necessarily a prediction of unseen 
behavior. We must note, however, that often the goal in tumor growth mod- 
elling is to find the model that best fits the available data on a fundamental 
level, both initially and in the future, and overfitting is not an immediate con- 
cern with strongly model-driven research. 

The dataset used in this research was a selection of patients that underwent 
surgical resection, but no radio- or chemotherapy. Although it is fair to assume 
that the diffusive behavior of the tumor is not affected during the surgery, so the 
model parameters would stay the same, the future growth pattern can be affected 
by the removal of tumor tissue. The decompression that occurs at resection also 
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complicates the registration of post-operative imaging, which led to the exclusion 
of one patient due to a failed registration. However, with surgical resection being 
the recommended treatment for most glioma patients, this is a complicating 
factor that is difficult to avoid in clinical datasets and in any application in 
clinical practice. 

As new methods of tumor growth prediction are developed, and even fully 
data-driven models are emerging using machine learning, comparing model per- 
formance becomes increasingly relevant. For that purpose, the framing of the 
problem is essential. Between the actual mechanisms of tumor growth and the 
segmentation is a flawed observation on MR imaging, the rather difficult prob- 
lem of segmentation and registration and an estimate of the time horizon. Those 
factors, combined with limited data and the fact that glioma are naturally unpre- 
dictable are a major reason why tumor growth models have relied heavily on sim- 
ulations [9] and qualitative observations [2] for their validation. This work is a 
step towards the comparison and clinical evaluation of tumor growth predictions 
that fits their spatiotemporal nature, and allows for localized interpretation. 
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Abstract. Medical image segmentation is a monotonous, time- 
consuming, and costly task performed by highly skilled medical anno- 
tators. Despite adequate training, the intra- and inter-annotator vari- 
ations results in significantly differing segmentations. If the variations 
arise from the uncertainty of the segmentation task, due to poor image 
contrast, lack of expert consensus, etc., then the algorithms for automatic 
segmentation should learn to capture the annotator (dis)agreements. In 
our approach we modeled the annotator (dis)agreement by aggregat- 
ing the multi-annotator segmentations to reflect the uncertainty of the 
segmentation task and formulated the segmentation as multi-class pixel 
classification problem within an open source convolutional neural archi- 
tecture nnU-Net. Validation was carried out for a wide range of imaging 
modalities and segmentation tasks as provided by the 2020 and 2021 
QUBIQ (Quantification of Uncertainties in Biomedical Image Quantifica- 
tion) challenges. We achieved high quality segmentation results, despite 
a small set of training samples, and at time of this writing achieved an 
overall third and sixth best result on the respective QUBIQ 2020 and 
2021 challenge leaderboards. 


Keywords: Multi-class segmentation - Noisy labels - Uncertainty 
aggregation - Convolutional neural networks - Challenge datasets 


1 Introduction 


Image segmentation is one of the fundamental tasks of medical imaging, crucial 
in modeling normal patient anatomy, detection of pathology, analysing patient’s 
health status and indicating medical treatments and procedures. For instance, 
manual segmentation prior to surgical tumor removal and organ-at-risk contour- 
ing for radiotherapy planning is a time-consuming, mundane and thus a costly 


task carried out by expert annotators. 


Developing automated algorithms can greatly reduce both time and money 
spent on medical image segmentation tasks. However, it is of extreme impor- 
tance to estimate the uncertainty of output segmentation, as poor segmentation 
may adversely impact upon based treatments and procedures. Despite extensive 
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expert training and experience, many researches found contours on common set 
of images to differ significantly between the experts [5]. These may naturally 
arise from the uncertainty of the segmentation task, due to poor image con- 
trast, lack of expert consensus, etc. We therefore should expect the uncertainty 
of annotations to reflect in the predictions of automated algorithms. 

The Quantification of Uncertainties in Biomedical Image Quantification 
(QUBIQ) challenge [10] aims to develop and evaluate automatic algorithms for 
quantification of uncertainties, arising from experts’ (dis)agreement in biomedi- 
cal image segmentation. In 2020 the challenge presented four different MR and 
CT image datasets on which a total of seven segmentation tasks were released. 
In 2021, the organisers added two datasets each with a single task. 

This paper presents our approach to capturing multi-annotator segmenta- 
tion uncertainty for nine tasks of the QUBIQ 2020 and 2021 challenges. First 
the multi-annotator segmentations were aggregated, considering the same per- 
formance level for each of the expert annotators, such that they approximate the 
segmentation task uncertainty. We advanced the state-of-the-art nnU-Net con- 
volutional neural network (CNN) model by casting multi-annotator uncertainty 
estimation as multi-class segmentation problem, where aggregated segmentations 
were the prediction target. Thus the model was able to capture and recreate the 
experts’ (dis)agreements. At the time of this writing’ the proposed approach 
achieved the third and sixth best scores on the respective QUBIQ 2020 and 
2021 leaderboards. 


2 Related Work 


Supervised machine learning models like the deep CNNs for image segmenta- 
tion generally require large training datasets of annotated images to achieve 
adequate performance levels. In medical imaging domain, however, we typically 
obtain small datasets due to the high effort required to obtain expert annota- 
tions (i.e. manual segmentations). When training models with a single expert 
segmentation per image we typically consider it as ground truth (GT), despite 
potential annotator bias and noise. A natural strategy to reduce the impact of 
annotator bias and noise is to consider the annotations of multiple experts. 

With the availability of multiple expert segmentations a common approach 
is to conceive a fusion strategy to approximate the GT [5]. The most straightfor- 
ward approach is Consensus voting, annotating the area as GT if all annotators 
agree, and Majority voting [4,9], assigning pixel labels according to the major- 
ity rule. These definitions can be generalized by using different agreement levels. 
Lampert et al. [7] reported that increasing the level of agreement for forming GT 
increased the model’s reported performance. They further noted that a higher 
agreement level could result in over-optimistic results, as this could be the con- 
sequence of choosing the most obvious segments of the region of interest (ROI). 
Further, the problem with such an approach is the loss of information about 
inter-annotator variability. 
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A more advanced and widely used approach to aggregating multiple expert 
segmentations is the Simultaneous Truth and Performance Level Estimation 
(STAPLE) algorithm proposed by Warfield et al. [13]. The STAPLE algorithm 
uses expectation-maximization to compute a probabilistic estimate of the true 
segmentation and the sensitivity and specificity performance characteristics for 
each annotator. A similar approach to STAPLE is used in SIMPLE [8] which 
additionally iteratively estimates the performance of segmentations and discards 
poorly performing segmentation before finally fusing the remaining segmenta- 
tions to estimate GT. Lampert et al. [7] showed that STAPLE performs well 
when inter-annotator variability is low, but degrades with the increasing num- 
ber of annotations and high variability of annotations. They also examined the 
effect of inter-annotator variance on foreground-background segmentation algo- 
rithms, in a computer vision setting. Despite not including deep neural networks, 
their results showed that the rank of the model is highly dependent on the cho- 
sen method used to form the GT. Furthermore, including a similar aggregation 
strategy into segmentation method will inevitably lead to overoptimistic results. 

Training machine learning models on datasets with multiple segmentation 
masks in a supervised manner allows for different representations and uses of 
the input data for model training. Firstly, each image-segmentation pair can be 
treated as a separate sample. For instance Hu et al. [1] propose a segmentation 
model based on the Probabilistic U-Net [6], where during model training multi- 
annotator segmentations of each image were fed to the network in the same 
mini-batch. Zhang et al. [14] took into account the multi-annotator dataset in 
the construction of the model architecture. The so called U-Net-and-a-half was 
constructed from a single encoder and multiple decoders. Each decoder corre- 
sponded to an expert allowing for simultaneous learning from all masks. The 
loss function was computed as the aggregated loss across all decoders. 

Many approaches model and/or quantify the segmentation output uncer- 
tainty. For instance, the model proposed by Hu et al. [1] based on the Proba- 
bilistic U-Net [6] uses inter-annotator variability as a training target. In this way, 
they were able to generate multiple diverse segmentations from each input, which 
represent a possible expert segmentation. Jungo et al. [4] computed uncertainty 
by the principle of Monte Carlo dropout. They used dropout layers at inference 
time to produce multiple segmentations and, by computing pixel-wise variance, 
estimated the model’s uncertainty. 

To summarize, when designing architectures for modeling annotator uncer- 
tainty on datasets with multiple annotations, we need to formulate several com- 
putational strategies: (i) a strategy to deal with multiple annotations per image 
in the model training input, (ii) a strategy to approximate the ground truth, 
and finally (iii) a strategy to model uncertainty on the model output. 

In this paper, we focus on the first and third points, i.e. the strategy of 
handling multiple annotations per image and modeling of output uncertainty, 
while as for the second point we latently acknowledge that ground truth may not 
exist. Thus we propose to aggregate multiple annotations into a single mask and 
to treat each level of agreement as a separate class. Modeling multi-annotator 
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uncertainty as multi-class segmentation problem can be simply coupled with 
any multi class segmentation model. According to a recent review on noisy label 
handling strategies [5] and our literature review, to the best of our knowledge, 
such a simple but effective solution to annotation aggregation and uncertainty 
modeling has not yet been proposed. 


3 Materials and Methods 


3.1 Datasets 


The QUBIQ 2020 challenge data consists of four 2D CT and MR datasets of 
different anatomies with seven segmentation tasks, where two of the datasets, 
namely Prostate and Brain tumor dataset, include multiple ROIs. The QUBIQ 
2021 challenge is an extension, including two additional 3D datasets, Pancreas 
and Pancreatic lesion, where each patient went through two scans at two time 
points. Each of the images was segmented by multiple trained experts, with 
annotator count ranging from 2 to 7, depending on the particular dataset. Addi- 
tional dataset information is given in Table 1 and a few examples are visualized 
in Fig. 1. 


Table 1. Number of given samples in training and validation dataset. 


Dataset No. samples No. structures | No. contours | No. modalities 
(Train/Val.) 
Prostate 55 (48/7) 2 6 1 
Brain growth 39 (34/5) 1 7 1 
Brain tumor 32 (28/4) 3 3 4 
Kidney 24 (20/4) 1 3 1 
Pancreas 58 (40/18) 1 2 1 
Pancreatic lesion | 32 (22/10) 1 2 1 


3.2 Multi-annotation Aggregation 


For segmentation of ROI given multiple annotations, we aggregated the given 
binary segmentation masks into a single input mask M*” as 


N 
M” (x,y) =) Bilz,y), (1) 


where N denotes the number of experts and B; the binary value of pixel (x,y) 
as annotated by i-th expert. The values of the encoded mask were thus between 
0 and the number of experts, where each foreground pixel value denotes the 
number of experts labeling the selected pixel as the ROI. In this way we encode 
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Fig. 1. Exemplary images with multi-annotator masks for segmentation tasks with 
single modalities (left to right): Prostate — Task 1, Prostate — Task 2, Brain growth 
and Kidney. The color notes the number of experts marking the area as segmented 
organ, from 0 (blue) to all (red). (Color figure online) 


the three-dimensional mask input (no. of annotators, width, height) and map it 


into a two-dimensional space [N X Y] a [X Y], for image width X and image 
height Y, as shown in Fig. 2. By encoding multiple image masks, we transformed 
the problem into multi-class classification problem, with N +1 classes (including 
background), where class c marks the agreement of exactly c annotators, for 
cE {0,1,...,N}. 


(a) (b) (c) (d) 


Fig. 2. Encoding of three binary segmentation masks (a), (b) and (c) into a single 
encoded multi-annotator mask (d), where each pixel value equals to the number of 
experts marking the particular pixel as the ROI. 


ory N WwW 


The CNN output is a three dimensional matrix |N +1 X Y], with a vector 
(po, P1,---,PN) for pixel (x, y), where p; marks the probability that the pixel was 
marked as ROI by exactly c annotators. By computing argmax,.p, for each pixel, 
we get a two dimensional output mask M°“* with predicted regions of agreement 
between experts. We can further decode the output mask into three-dimensional 
space, as shown in Fig.3, we obtain as many masks as there are annotators, 
however, this time each mask represents a quantitative agreement value. Thus, 
the output mask M?”* represents the area where the structure of interest has 
been annotated by at least one annotator, whereas the marked area decreases 
with increasing the index j in 


Me™ (x,y) = (Me (x,y) > c). (2) 
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The output mask M?“* thus represents the pixels that would be marked by 
at least one annotator, M$“’ by at least two annotators, etc. Further, divid- 
ing the output mask values with the number of annotators N results in values 
on the interval [0,1], which can be interpreted as annotation or segmentation 
(un)certainty and reflects the uncertainty of the expert annotators. 


(a) (b) (c) (d) 


Fig. 3. Decoding of model output a) M° into three binary segmentation masks b) 
Me™, c) M$“ and d) M$"*, where Mo"* denotes the predicted mask with ROI marked 
by at least c experts. 


3.3 Segmentation Model 


Structure segmentation and its uncertainty estimation was obtained by adapting 
the open-source nnU-Net [2,3]. The nnU-net framework implements a single or 
cascaded U-net model and, based on the input images, the particular model 
and its hyperparameters are chosen and configured automatically. The following 
subsections describe the framework and its adaptations. 


Model Architecture. The nnU-Net (‘no-new-Net’) uses a 2D U-net or 3D U- 
Net [11] as a backbone architecture. The main advantage of nnU-Net is it’s self 
configuring training pipeline and automatic adaptation of model architecture 
and hyperparameter tuning that considers the available hardware resources and 
requires little or no user input. The encoder part starts with 32 feature maps 
in the initial layers and doubles the number of feature maps with each down- 
sampling and vice versa in the decoder. The number of convolutional blocks is 
adapted to the input patch size, assuring that downsampling does not result 
in feature maps smaller than 4 x 4(x4). Compared to the original U-net, the 
nnU-net authors replaced the ReLU activation functions with leaky ReLU and 
batch normalization with instance normalization. 


Loss Function. We applied the soft Dice loss function directly on CNN out- 
put probabilities. The output values were mapped to the [0, 1] interval using the 
softmax activation function on the output layer. For each class c € {0,1,..., N}, 
where class c = 0 represents the background without any annotations, we com- 
pute soft Dice similarity coefficient 

baer pe(2, y) ` Me” (x, y) 


sDSC, = - i 3 
LaPa +O, Mea g 
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where pe(x, y) denotes the output probability of pixel (x,y) belonging to class 
c, M2" the binary input mask of class c. Finally, Dice coefficient was averaged 
over all N + 1 classes. For the loss function we take the negative value 


N 
1 
Loss = “Wel > sDSC,. (4) 


Model Training. For each of the nine segmentation tasks on six dataset we 
trained a separate nnU-Net model that converged on average in 50 epochs. A 2D 
model was trained for each of the 2D image segmentation task and a 3D model 
with full resolution for the two 3D segmentation tasks. Note that 2D models 
were trained also for 3D data, however, the 2D model performed worse than the 
3D model. In the case of multi modal data, i.e. brain lesions, a single model was 
trained using all image modalities as the model input. 

Based on the data fingerprint and a series of heuristic rules the image resam- 
pling and image normalization were determined. Further, the architecture of 
nnU-Net dynamically adapted to the dataset, selecting appropriate image input 
patch size and batch size [2]. To allow training on large image patches, the batch 
size was generally small, typically (but not less than) two images per batch. 

The nnU-Net model training included various data augmentation transfor- 
mations, each with certain probability p. Namely, random rotations (p = 0.2), 
scaling (p = 0.2), mirroring (p = 0.2), Gaussian noise (p = 0.1) and smoothing 
(p = 0.2), and additive or multiplicative inhomogeneity simulation (p = 0.15 or 
0.15, respectively). Models were trained using stochastic gradient descent opti- 
mizer with an initial learning rate of 0.01 and Nesterov momentum of 0.99. 

The nnU-Net models were trained on patches that overlapped by half of the 
patch size. During inference the same patch size was used as during training. The 
predicted patches were then combined such that the contributions of different 
patch predictions across the common voxels were aggregated by weighing the 
predictions based on the voxel location. Since accuracy was expected to drop 
towards the patch border, the contribution of such voxels was less then the 
pixels close to the patch center. 

Finally, the predictions were postprocessed by first checking the training 
dataset samples if all classes lied within a single connected component. In this 
case, this property was also imposed to the test set by retaining the single largest 
connected component for each class. 


3.4 Evaluation Metrics 


Model performance was evaluated according to the provided evaluation code by 
the QUBIQ challenge organizers. We compared the predicted uncertainty mask 
M°*“*/N with the uncertainty of the GT, computed as M?” /N. For each image, 
the uncertainty masks were binarized at thresholds 0.1 x i; i = 0,1,...,9, for 
which the Dice coefficient DS'C; was computed as 


2T P; 


D 4 = ’ 
o 2T P; + FP; + FN; 
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where TP denotes the true positive pixels, FP denotes the false positive pixels 
and FN the false negative pixels. Finally the scores were averaged across all ten 
values for the final performance estimation 


4 Results 


The results of our proposed model on six datasets and across nine segmentation 
tasks were computed on the validation datasets and are reported in Table2 and 
Fig. 4. In four of the 2D segmentation tasks our approach achieved an average 
Dice score over 0.9, while for the other three 2D tasks it achieved a score of over 
0.7. The lowest scores, significantly below the average, were achieved for the two 
3D segmentation tasks introduced in QUBIQ 2021 challenge. 


Table 2. Performance measure DSC per segmentation task evaluated by QUBIQ 
challenge organizers. The average is computed over seven tasks for QUBIQ 2020 (dis- 
regarding pancreas and pancreatic lesion) and over nine tasks for QUBIQ 2021. (Note: 
Evaluation metrics on QUBIQ 2020 and 2021 leaderboard are not identical. The aver- 
age score of our model reported on QUBIQ 2020 leaderboard equals to 0.7476.) 


Structure DSC 

Brain growth 0.9336 
Brain tumor - Task 1 0.9485 
Brain tumor - Task 2 0.7808 
Brain tumor - Task 3 0.7639 
Kidney 0.9766 
Prostate - Task 1 0.9610 
Prostate - Task 2 0.8280 


Pancreas 0.5605 
Pancreatic lesion 0.3990 
Average 

— QUBIQ 2020 0.8846 


- QUBIQ 2021 0.7946 


120 M. Zukovec et al. 


gomed $ o, oyna 

0.8; +—} a, 

0.7; . ° eh 
© 0.6} o E L 
D 0.5; i = 

0.47 ve + 

0.3; i 

0.2; i 

0.17 o 

0.0; 


a) b) c) d) e) f) g) h) 


Fig. 4. Scatter plot (left) with marked mean values (red) and boxplot (right) of indi- 
vidual values of the average Dice coefficient DSC on validation images for seven tasks: 
a) Brain growth, b) Brain tumor - Task 1, c) Brain tumor - Task 2, d) Brain tumor - 
Task 3, e) Kidney, f) Prostate - Task 1, g) Prostate - Task 2, h) Pancreas, i) Pancreatic 
lesion. (Color figure online) 


Due to the nature of the metric DSC, the error of an incorrectly predicted 
pixel can accumulate when computing the DSC across multiple thresholds. In 
some validation images from the Brain tumor dataset (Tasks 2 and 3) the area of 
the annotated structures, where a fraction of the experts agree, measured only a 
few pixels. For such a small area an incorrect output value of even a single pixel 
changes the DSC metric value by a substantial amount. 

Structures a) Brain growth, b) Brain tumor - Task 1, e) Kidney and f) 
Prostate - Task 1 were predicted consistently, without significant variation in 
the DSC between different cases. Due to the consistent labelling of all experts 
and consistent size of the structures, the neural network predictions were also 
consistent. In the case of the listed structures, the region of agreement of all 
experts was much larger than the region of disagreement, compared to the other 
structures. In practice, this means that the misperceived agreement pattern of 
a subset of annotators does not contribute to the value of metric DSC to the 
extent that it does in the case of small structures. 

For the two 3D segmentation tasks, i.e. h) Pancreas and i) Pancreatic lesion, 
we observed a large variation of the DSC values. Specifically, in the cases with 
the value of SDC equal to 0.1, the model did not segment the ROI and instead 
returned an empty mask. 


5 Discussion and Future Work 


Intra- and inter-annotator variations result in significantly differing manual 
segmentations, which may be related to the uncertainty of the segmentation 
task; hence, the algorithms for automatic segmentation should learn to cap- 
ture the annotator (dis)agreements. In our approach we modeled the annotator 
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(dis)agreement by aggregating the multi-annotator segmentations to reflect the 
uncertainty of the segmentation task and formulated the segmentation as multi- 
class pixel classification problem within an open source nnU-Net framework [3]. 

Validation was carried out for a wide range of imaging modalities and seg- 
mentation tasks as provided by the 2020 and 2021 QUBIQ challenges and showed 
high quality segmentations according to the average Dice scores. While inspect- 
ing our results we noticed a large variation in Dice scores across validation cases 
for the 2D Brain tumor segmentation tasks 2 and 3 and both segmentation tasks 
on the 3D datasets. In part, the low DSC scores in particular cases and high 
variability in the score in the aforementioned tasks can be attributed to the fact 
that the area of agreement covers only a few pixels. This is particularly evident 
for Brain tumor segmentation - Task 2, as shown in Fig.5, where one of the 
raters consistently segments different ROIs as the other two raters. This system- 
atic difference is also captured by the model, that did not classify any of the 
pixels as the area, where all the three annotators would agree. 


MR Image Ground truth Prediction 


Fig. 5. Ground truth and prediction for Brain tumor - Task 2. Single annotator’s ROIs 
marking (blue) significantly differ from the segmentation of the other two (yellow), with 
a very small overlap of all three (red). (Color figure online) 


In 3D space, the ratio between the background and ROI becomes even larger. 
The poor result can therefore again be partially contributed to class imbalance. 
Further we can notice a large difference in input image sizes in z axis, that varies 
from 36 to 194 pixels on the training set. To potentially improve the results, 
reducing the image size around the ROI, before training the neural network 
could be considered. 

When forming the aggregated target segmentation, we assumed, that all 
experts were equally trained and thus we took the sum of their segmentation 
masks as the ground truth. However, in the case of major disagreements in anno- 
tations, such as for Brain tumor segmentation - Task 2, a smaller weight could 
be given to the annotator that is not in accordance with the others. The per- 
formance might therefore be improved by the use of expert performance level 
estimates as obtained from the SIMPLE algorithm [8], confusion matrices as in 
Tanno et al. [12] or similar approaches for generating GT used as target masks. 

Finally, one of the main limitations of modeling multi-annotator 
(dis)agreement as multi-class problem is it’s sensitivity to minor changes of 
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the softmax function, which can result in pixel misclassification. A change of 
argmax function by, for example, weighted sum of classes using softmax outputs 
as weights, could result in a more robust model. 


6 Conclusion 


The goal of the QUBIQ challenge was to segment nine different structures of 
interest, i.e. organs and pathologies, in six different datasets, for which segmen- 
tation masks of multiple experts were provided. In the context of the established 
nnU-Net segmentation framework, we proposed a novel strategy of handling 
multiple annotations per image and modeling of output uncertainty. Namely, 
we aggregate multiple annotations into a single mask and to treat each level 
of agreement as a separate class, thus modeling multi-annotator segmentation 
uncertainty as multi-class segmentation problem. We achieved high quality seg- 
mentation results with an overall third and sixth best overall Dice score result 
on the respective QUBIQ 2020 and 2021 challenge leaderboards. 
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Abstract. Glioblastoma is profoundly heterogeneous in regional 
microstructure and vasculature. Characterizing the spatial heterogene- 
ity of glioblastoma could lead to more precise treatment. With unsuper- 
vised learning techniques, glioblastoma MRI-derived radiomic features 
have been widely utilized for tumor sub-region segmentation and sur- 
vival prediction. However, the reliability of algorithm outcomes is often 
challenged by both ambiguous intermediate process and instability intro- 
duced by the randomness of clustering algorithms, especially for data 
from heterogeneous patients. 

In this paper, we propose an adaptive unsupervised learning approach 
for efficient MRI intra-tumor partitioning and glioblastoma survival pre- 
diction. A novel and problem-specific Feature-enhanced Auto-Encoder 
(FAE) is developed to enhance the representation of pairwise clinical 
modalities and therefore improve clustering stability of unsupervised 
learning algorithms such as K-means. Moreover, the entire process is 
modelled by the Bayesian optimization (BO) technique with a custom 
loss function that the hyper-parameters can be adaptively optimized in 
a reasonably few steps. The results demonstrate that the proposed app- 
roach can produce robust and clinically relevant MRI sub-regions and 
statistically significant survival predictions. 
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clustering - Bayesian optimization - Survival prediction 


C. Li—Equal contribution. 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 
A. Crimi and S. Bakas (Eds.): BrainLes 2021, LNCS 12962, pp. 124-139, 2022. 
https: //doi.org/10.1007/978-3-031-08999-2_10 


Adaptive Learning with Enhanced Representation for Glioblastoma 125 


1 Introduction 


Glioblastoma is one of the most aggressive adult brain tumors characterized 
by heterogeneous tissue microstructure and vasculature. Previous research has 
shown that multiple sub-regions (also known as tumor habitats) co-exist within 
the tumor, which gives rise to the disparities in tumor composition among 
patients and may lead to different patient treatment response [9,10]. Regional 
differences within the tumour are often seen on imaging and may have a prognos- 
tic significance [30]. The intra-tumor heterogeneity is near ubiquitous in malig- 
nant tumors and likely to reflects cancer evolutionary dynamics [12,25]. There- 
fore, this intra-tumoral heterogeneity has significantly challenged the precise 
treatment of patients. Clinicians desire a more accurate identification of intra- 
tumoral invasive sub-regions for targeted therapy. 

Magnetic resonance imaging (MRI) is a non-invasive technique for tumor 
diagnosis and monitoring. MRI radiomic features [22] provide quantitative infor- 
mation for both tumor partition and survival prediction [7,8]. Mounting evi- 
dence supports the usefulness of the radiomic approach in tumor characteriza- 
tion, evidenced by the Brain Tumor Image Segmentation (BraTS) challenge, 
which provides a large dataset of structural MRI sequences, i.e., T1l-weighted, 
T2-weighted, post-contrast T1-weighted (T1C), and fluid attenuation inversion 
recovery (FLAIR). Although providing high tissue contrast, these weighted MRI 
sequences are limited by their non-specificity in reflecting tumor biology, where 
physiological MRIs, e.g., perfusion MRI (pMRI) and diffusion MRI (AMRI), 
could complement. Specifically, p MRI measures vascularity within the tumor, 
while dMRI estimates the brain tissue microstructure. Incorporating these com- 
plementary multi-modal MRI has emerged as a promising approach for more 
accurate tumor characterization and sub-region segmentation for clinical deci- 
sion support. 

Unsupervised learning methods have been widely leveraged to identify the 
intra-tumoral sub-regions based on multi-modal MRI [4,17, 19,26, 29,31]. Stan- 
dard unsupervised learning methods, e.g., K-means, require a pre-defined class 
number, which lacks concrete determination criteria, affecting the robustness of 
sub-region identification. For instance, some researchers used pre-defined class 
numbers according to empirical experience before clustering [4,17]. Some other 
work [14,31] introduced clustering metrics, e.g., the Calinski-Harabasz (CH) 
index, which quantifies the quality of clustering outcomes to estimate the ideal 
class number. However, the CH index is sensitive to data scale [14,31], limit- 
ing its generalization ability across datasets. Some other clustering techniques, 
e.g., agglomerative clustering, do not require a pre-defined class number and 
instead require manual classification. A sensitivity hyper-parameter, however, is 
often needed a priori. The clustering results can be unstable during iterations 
and across datasets. Due to the above limitations, the generalization ability of 
clustering methods has been a significant challenge in clinical applications, par- 
ticularly when dealing with heterogeneous clinical data. 

Further, the relevance of clustering results is often assessed using patient sur- 
vival in clinical studies [2,6,11,17]. However, existing research seldom addressed 


126 Y. Li et al. 


the potential influence of instability posed by the unsupervised clustering algo- 
rithms. Joint hyper-parameter optimization considering both clustering stability 
and survival relevance is desirable in tumor sub-region partitioning. 

In this paper, we propose a variant of auto-encoder (AE), termed Feature- 
enhanced Auto-Encoder (FAE), to identify robust latent feature space con- 
stituted by the multiple input MRI modalities and thus alleviate the impact 
brought by the heterogeneous clinical data. Additionally, we present a Bayesian 
optimization (BO) framework [24] to undertake the joint optimization task in 
conjunction with a tailored loss function, which ensures clinical relevance while 
boosting clustering stability. As a non-parametric optimization technique based 
on Bayes’ Theorem and Gaussian Processes (GP) [21], BO learns the repre- 
sentation of the underlying data distribution that the most probable candidate 
of the hyper-parameters is generated for evaluation in each step. Here, BO is 
leveraged to identify the (sub)optimal hyper-parameter set with the potential to 
effectively identify robust and clinically relevant tumor sub-regions. The primary 
contributions of this work include: 


— Developing a novel loss function that balances the stability of sub-region 
segmentation and the performance of survival prediction. 

— Developing an FAE architecture in the context of glioblastoma studies to 
further enhance individual clinical relevance between input clinical features 
and improve the robustness of clustering algorithms. 

— Integrating a BO framework that enables automatic hyper-parameter search, 
which significantly reduces the computational cost and provides robust and 
clinically relevant results. 


The remainder of this paper is organized as follows. Section 2 describes the 
overall study design, the proposed framework, and techniques. Section 3 reports 
numerical results, and Sect. 4 is the concluding remarks. 


2 Problem Formulation and Methodology 


Consider an N patients multi-modal MRI dataset 92 with M modalities defined 
as {Xm} _]. Xm denotes the mth (pixel-wise) modality values over a collection 
of N patients. Xm = {Xm n}+_,, where Xm,n E R?m»*1 and Im, n denotes total 
pixel number of an individual MRI image for the mth modality of the nth patient. 

Our goal is to conduct sub-region segmentation on MRI images and perform 
clinically explainable survival analysis. Instead of running unsupervised learning 
algorithms directly on Xm, we introduce an extra latent feature enhancement 
scheme (termed FAE) prior to the unsupervised learning step to further improve 
the efficiency and robustness of clustering algorithms. 

As shown in Fig.1(A), FAE aims to produce a set of latent features 
{Zm}™_, that represent the original data {Xm}¥_;. Unlike a standard AE 
that takes all modalities as input, FAE ‘highlights’ pairwise common features 
and produces Z through a set of encoders (denoted as Æ) and decoders (denoted 
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as D). The latent features are then used in unsupervised clustering to clas- 
sify tumor sub-region {P,,}_, for all patients. As an intermediate step, we 
can now produce spatial features {F,,}_, from the segmented figures through 
radiomic spatial feature extraction methods such as gray level co-occurrence 
matrix (GLCM) and Gray Level Run Length Matrix (GLRLM) [15]. 
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Fig. 1. A: Workflow of the proposed approach. The entire process is modelled under 
a Bayesian optimization framework. B: Architecture of FAE. The light orange circle 
represents modality Xm overall patients and the blue circle is the latent feature Zm’. 
The green dotted frame denotes the modality pair, and the green trapezoid represents 
feature-enhanced encoder Æ and decoder D. The blue trapezoid indicates the fully 
connected decoders Ds. C: Illustration of stability loss calculation. Circles in different 
colours represent individual patient MRI data, which are then randomly shuffled for 
K times to split into train/validation sets. (Color figure online) 
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2.1 Feature-Enhanced Auto-Encoder 


FAE is developed on Auto-encoder (AE), a type of artificial neural network used 
for dimensionality reduction. A standard AE is a 3-layer symmetric network 
that has the same inputs and outputs. As illustrated in Fig. 1(B), FAE contains 
W feature-enhanced encoder layers {E,,}“_, to deal with {Gu}; pairs of 
modalities, where W = es pairs of modalities (from combination) given M 
inputs. The wth encoder takes a pair of modalities from {Xm}; and encodes 
to a representation €w. The central hidden layer of FAE contains {Zm:}™_, 
nodes that represents M learnt abstract features. FAE also possesses a ‘mirrored’ 
architecture similar to AE, where W feature-enhanced decoder layers {Dy }¥_, 
are connected to the decoded representations {du} W]. 

Unlike the standard symmetric AE, FAE has a ‘dual decoding’ architecture 
that an extra fully-connected decoder layer D, is added to the decoding half of 
the networks to connect {d,,}"_, directly to the outputs {X’,,, }/_,. Decoder Ds 
aims to pass all outputs information (and correlations) rather than the pairwise 
information from Gw in the back-propagation process. As a result, node weights 
{Zm }“_, are updated by gradients from both {D,,}V_, and Ds. In practice, Z 
and the encoders are iteratively amended by {D,,}"_, (i.e., reconstruction loss 
from pairwise AEs) and D, (i.e., global reconstruction loss) in turns. 

FAE enhances the latent features in every pair of input modalities before 
reducing the dimensionality from W to M. For instance, e, is a unique rep- 
resentation that only depends on (and thus enhances the information of) the 
given input pair Gu. Under this dual decoding architecture, FAE takes advan- 
tage of highlighting the pairwise information in {Zm}<“_, while retaining the 
global correlation information from D,. Another advantage of FAE lies in its 
flexibility to the dimensionality of input features. The FAE presented in this 
paper always produces the same number of latent features as the input dimen- 
sion. The latent dimension might be further reduced manually depending on 
computational/clinical needs. 


2.2 Patient-Wise Feature Extraction and Survival Analysis 


We implement Kaplan-Meier (KM) survival analysis [2,17] on spatial features 
and sub-region counts {F,, }_, to verify the relevance of clustering sub-regions. 
To characterize the intratumoral co-existing sub-regions, we employed the com- 
monly used texture features from the GLCM and GLRLM families, i.e., Long 
Run Emphasis (LRE), Relative mutual information (RMI), Joint Energy, Run 
Variance (RV) and Non-Uniformity. These features are formulated to reflect 
the spatial heterogeneity of tumor sub-regions. For example, LRE indicates the 
prevalence of a large population of tumor sub-regions. The formulas and inter- 
pretations of all these features are detailed in [27]. We next use the k-medoids 
technique to classify N patients into high- and low-risk subgroups based on 
{F,,}4_, and then perform KM analysis to analyze the survival significance of 
the subgroups to determine the Lp, as described in Sect. 2.4 and Eq. 2. 
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2.3 Constructing Problem-Specific Losses 


Stability Loss. We first introduce a stability quantification scheme to evaluate 
clustering stability using pairwise cluster distance [13,28], which will serve as 
part of the loss function in hyper-parameter optimization. Specifically, we employ 
a Hamming distance method (see [28] for details) to quantify the gap between 
clustering models. We first split the MRI training dataset Q into train and 
validation sets, denoted as Mirain and Puar respectively. We then train two 
clustering models C (based on Rirain) and C” (based on Ruai). The stability 
loss aims to measure the performance of model C on the unseen validation set 
Qyat. The distance d(-) (also termed as Ls) is defined as: 


L, = d(C, C") = min = 


T 


XO LCR) #0 Rua) (1) 
val Raal 


where Iq; denotes the total number of pixels over all MRI images in the vali- 
dation set yai. 1 represents the Dirac delta function [32] that returns 1 when 
the inequality condition is satisfied and 0 otherwise, and function 7(-) denotes 
the repeated permutations of dataset (2 to guarantee the generalization of the 
stability measure [28]. 

Figure 1 (C) shows the diagram for L, calculation, where N patients are 
randomly shuffled for K times to mitigate the effect of randomness. K pairs of 
intermediate latent features {Ztrain,k; Zval,r }h_1 are generated through FAE for 
training the clustering models C and C’. We then compute L, over K repeated 
trials. Ls is normalized to range [0,1], and smaller values indicates more stable 
clusterings. 


Significance Loss. We integrate prior knowledge from clinical survival analysis 
and develop a significance loss Lp to quantify clinical relevance between the 
clustering outcomes and patient survival, as demonstrated in the below equation: 


Ly = log(—) (2) 
P 
where p represents p-value (i.e., statistical significance measure) of the log-rank 
test in the survival analysis and 7 is a predefined threshold. 

This follows the clinical practice that a lower p-value implies that the seg- 
mented tumor sub-regions can provide sensible differentiation for patient sur- 
vival. In particular, given threshold 7, for p less than the threshold, the loss 
equation returns a increasing positive reward. Otherwise, for p greater than or 
equal to 7, the segmented tumor sub-regions are considered undesirable and the 
penalty increases with p. 


2.4 Bayesian Optimization 


Hyper-parameters tuning is computational expensive and often requires expert 
knowledge, both of which raise practical difficulties in clinical applications. In 
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this paper, we consider two undetermined hyper-parameters: a quantile threshold 
y € [0,1] that distinguishes outlier data points from the majority and cluster 
number 7 for the pixel-wise clustering algorithm. We treat the entire process of 
Fig. 1(A) as a black-box system, of which the input is the hyper-parameter set 
6 = [7,7] and the output is a joint loss £ defined as: 


L= aL, + (1 — a)Lp (3) 


where a is a coefficient that balances L, and Lp and ranges between [0,1]. 


Algorithm 1: Bayesian optimization for hyper-parameter tuning 


1 Initialization of GP surrogate f and the RBF kernel K(-) 

2 while not converged do 

3 Fit GP surrogate model f with {0;,£;}7.1 

4 Propose a most probable candidate 6;+1 through Equation (4) 

5 Run Algorithm 2 with 0;41, and compute loss £j+1 

6 Estimate current optimal 6;+2 of the constructed GP surrogate f’ 
7 Run Algorithm 2 with 6;+2, calculate the loss Lj+2 

8 J=J+2 

9 end 

10 Obtain (sub)optimal 6, upon convergence 


We address the hyper-parameter tuning issue by modelling the black-box 
system under BO, a sequential optimization technique that aims to approximate 
the search space contour of 0 by constructing a Gaussian Process (GP) surrogate 
function in light of data. BO adopts an exploration-exploitation scheme to search 
for the most probable 0 candidate and therefore minimize the surrogate function 
mapping f : O — L in J optimization steps, where O and £ denote input and 
output spaces respectively. The GP surrogate is defined as: f ~ GP(-|, ©); 
where yp is the J x 1 mean function vector and & is a J x J co-variance matrix 
composed by the pre-defined kernel function K(-) over the inputs {6,}7_4. In 
this paper, we adopt a standard radial basis function (RBF) kernel (see [3] for 
an overview of GP and the kernel functions). 

Given training data Ng = {0;, Litaa» BO introduces a so-called acquisi- 
tion function a(-) to propose the most probable candidate to be evaluated at 
each step. Amongst various types of acquisition functions [24], we employ an 
EI strategy that seeks new candidates to maximize expected improvement over 
the current best sample. Specifically, suppose f’ returns the best value so far, EI 
searches for a new @ candidate that maximizes function g(@) = max{0, f’—f(0)}. 
The EI acquisition can thus be written as a function of 0: 


agr(0) = E(g(@)|2z) = (F — HBF u, E) + EN (fm, =) (4) 
where ®(-) denotes CDF of the standard normal distribution. In practice, BO 
step J increases over time and the optimal 6, can be obtained if the predefined 


convergence criteria is satisfied. Pseudo-code of the entire process is shown in 
both Algorithms 1 and Algorithm 2. 
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2.5 Experiment Details 


Data from a total of N = 117 glioblastoma patients were collected and divided 
into training set 2 = 82 and test set Prest = 35, where the test set was separated 
for out-of-sample model evaluation. We collected both pMRI and dMRI data 
and co-registered them into T1C images (details in Appendix 5.1), containing 
approximately 11 million pixels per modality over all patients. M = 3 input 
modalities were calculated, including rCBV (denoted as r) from pMRI, and 
isotropic/anisotropic components (denoted as p/q) of dMRI, thus X = {p,q, r}. 
Dataset (2 was used for stability loss calculation with Rirain = 57, Ruat = 25. 
L, was evaluated over K = 10 trials for all following experiments. The BO is 
initialized with J = 10 data points Qs, y € [0,1] and 7 is an integer ranges 
between 3 and 7. The models were developed on Pytorch platform [18] under 
Python 3.8. Both encoder EF and decoder D employed a fully connected feed- 
forward NN with one hidden layer, where the hidden node number was set to 
10. We adopted hyperbolic tangent as the activation function for all layers, mean 
squared error (MSE) as the loss function, and Adam as the optimiser. 


Algorithm 2: Pseudo-code of the workflow as a component of BO 


// Initialization 
1 Prepare MRI data 2 with N patients and M modalities, perform data filtering 
with quantile threshold y 
// FAE training follows Figure 1(B) 
2 Compose W pairs of modalities GW_,, where W = C1) 
3 Train FAE on {Xm}, to generate latent features {Zm } %1 
// Stability loss calculation follows Figure 1(C) 
4 for k =1,2,...,K do 
Randomly divide 9 into train (Qirain) and validation (2,41) sets 
Produce latent pairs {Zerain,k, ye i 
// Pixel-wise clustering 
Obtain C; and Cj, through standard K-means with 7 clusters 
8 Compute kth stability loss Ls, by Eq (1) 
9 end 
10 Compute stability score L, by averaging over {Ls k H1 
// Sub-region segmentation 
11 Obtain patient-wise sub-region segments {Pn}: 
// Patient-wise feature extraction 
12 Extract {Fn}; for all N patients 
// Survival analysis 
13 Cluster patients into high/low risk subgroups based on {Fn}; using a 
standard K-Medoids algorithm. Perform survival analysis and obtain p 
// BO loss calculation 
14 Compute clinical significance score Lp by Eq (2) 
15 Compute joint loss L follows Eq (3) 
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3 Results and Discussions 


We first present the clustering stability of the models incorporating FAE architec- 
ture, which contains 1 hidden layer with 10 hidden nodes. The hyper-parameter 
choice of FAE architecture, which is simple to be compared in numerical exper- 
iments, are determined by empirical experiences. Other AE variants against the 
baseline model and then compare the performance of the proposed methodol- 
ogy under different experimental settings. We finally demonstrate the results of 
survival analysis and independent test. 


3.1 Evaluation of FAE Based Clustering 


The results comparing the models are detailed in Table 1. One sees that all three 
AE variants show better stability performance than that of the baseline model 
in the varying cluster numbers. Of note, our proposed FAE architecture, which 
incorporates both standard AE and ensemble AE, outperforms other models in 
majority comparisons. 


Table 1. Stability performance of cluster algorithms under different AE variants. Base- 
line represents the original model without AE. The standard AE represents a standard 
3-layer (with 1 hidden layer) feed-forward network and the ensemble AE is the FAE 
without dual decoder D,. The hidden layer contains 10 nodes for all AE variants. 


Clusters 3 4 5 6 

Stability score 

Baseline 0.761+0.026 | 0.890+0.04 0.744+0.027 | 0.761+0.035 
Standard AE 0.90940.024 | 0.896+0.063 | 0.859+0.06 | 0.836-40.061 
Ensemble AE 0.972+0.013 | 0.921+0.028 | 0.872-40.046 | 0.881+0.046 
FAE 0.909+0.048 | 0.923+0.029 | 0.911+0.038 | 0.891+0.048 
Calinski-Harabasz (CH) score 

Baseline (10°) 4.12+0.00003 | 5.16+0.00013 | 4.82+0.00003 | 4.730.00009 
Standard AE (10°) 5.94£0.63 5.74£0.51 5.500.41 5.360.28 
Ensemble AE (10°) 10.430.67 10.99+0.52 10.980.89 11.09+1.00 
FAE (10°) 13.85+4.45 |14.85+4.49 | 15.09+4.19 | 15.34+4.14 


As expected, all AE variants enhance the clustering stability and quality, 
shown by the stability score and CH score. The latter is relatively sensitive 
to data scale but can provide reasonable evaluation for a fixed dataset. In our 
case, as the dimensions of the original input modalities and the latent features 
remain identical (M = 3), the considerably improved stability of the models 
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incorporating FAE architecture suggests the usefulness of the FAE in extracting 
robust features for the unsupervised clustering. Additionally, our experiments 
show that the FAE demonstrates remarkably stable performance in the clustering 
when the training data is randomly selected, which further supports the resilience 
of the FAE in extracting generalizable features for distance-based clustering 
algorithms. 


3.2 Adaptive Hyper-parameter Tuning 


Figure 2 shows the performance of the proposed approach in 4 different a values 
in terms of stability score (lower score value indicates better stability). 10 initial 
training steps and 20 follow-up BO steps are evaluated in the experiments, all 
the results are averaged over 10 repeated trials. One sees significant dispersion 
of initial points (dots in the left half of each figure) in all figures, indicating 
reasonable randomness of initial points in BO training. BO proposes a new 
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Fig. 2. Performance of the proposed approach with respect to BO step number (on 
x-axis). Each figure contains two y-axis: stability loss Ls (in blue) on the left y-axis, 
and both significant loss L, (in green) and joint loss (in orange) on the right y-axis. 
All losses are normalized and the shadowed areas in different colors indicate error-bars 
of the corresponding curves. Figure (a)—(d) shows the performance with loss coefficient 
a = 0,0.25,0.5 and 1, respectively. (Color figure online) 
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candidate 0 per step after the initial training. One observes that the joint loss 
L (orange curves) converges and the proposed approach successfully estimates 
(sub)optimal 0, in all a cases. 

Figure 2(a) shows a = 0 case, for which £ = L, according to Equation (3). 
In other words, the algorithm aims to optimize significance loss Lp (green curve) 
rather than stability loss Ls (blue curve). As a result, the orange and green 
curves overlap with each other, and the stability scores are clearly lower than 
that of Ls. A consistent trend can be observed across all four cases that the 
error-bar areas of L, (blue shadowed areas) shrink as the weight of L, increases 
in the joint loss. Similar observations can be seen in Fig. 2(d) where a = 1 and 
L = Ls, the error-bar area of Lp (green shadowed area) is considerably bigger 
than those in the rest a cases. Note that L, and £ also overlap with each other 
and the mismatch in the figure is caused by the differences of left and right 
y-axis scale. When a = 0.5 (Fig. 2(c)), clustering stability can quickly converge 
in a few BO steps (around 6 steps in the orange curve), shows the advantage of 
the proposed BO integrated method in hyper-parameter optimization. 


3.3 Statistical Analysis and Independent Test 


Upon convergence of BO, we acquire well-trained FAE encoders to extract fea- 
tures from modalities, a well-trained clustering model for tumor sub-region seg- 
mentation and a population-level grouping model to divide patients into high- 
risk and low-risk subgroups. Eventually, we acquire 5 tumor sub-regions as 
{P,,}_, from features processed by the well-trained FAE, where P, = {p;}4_,, 
p; € {1,2,3,4,5} denotes the sub-region labels for each pixel, and produce fea- 
tures {F,,}\_,, where F,, € R!!! represents 9 spatial features and proportion 
of the 2 significant sub-regions, the details of clinical features could be found in 
Appendix 5.2. Subsequently, we apply these well-trained models to the test set 
with 35 patients. The results of KM analysis are shown in Fig. 3, illustrating that 
the spatial features extracted from tumor sub-regions could lead to patient-level 
clustering that successfully separates patients into distinct survival groups in 
both datasets (Train: p-value = 0.013 Test: p-value = 0.0034). Figure 4 shows two 
case examples from the high-risk and low-risk subgroups, respectively, where dif- 
ferent colours indicate the partitioned sub-regions. Intuitively, these sub-regions 
are in line with the prior knowledge of proliferating, necrotic, and edema tumor 
areas, respectively. 


Adaptive Learning with Enhanced Representation for Glioblastoma 135 


1.0) — lowrisk 1.0 — low risk 
—— high risk —— high risk 
> 0.8 > 0.8 
£ £ 
3 2 
£ 0.6 2 0.6 
g e 
a a 
goa goa 
Z č 
-i = 
H 0.2 0 0.2 
0.0} p-value =0.013 0.0} p-value =0.0034 
0 500 1000 1500 2000 2500 Ò 200 400 600 800 1000 1200 1400 
Timeline Timeline 
(a) Train set 2 = 82 patients (b) Test set Miest = 35 patients 


Fig. 3. KM survival curves for the train and test datasets. 


(a) low-risk (CE) (b) low-risk (NE) (c) high-risk (CE) (d) high-risk (NE) 


Fig. 4. Two case examples from the high-risk (a & b) and lower-risk (c & d) group, 
respectively. Different colours denote the partitioned sub-regions. The two patients 
have significantly different proportions of sub-regions with clinical relevance, which 
could provide clinical decision support. (Color figure online) 


4 Conclusions 


The paper is an interdisciplinary work that helps clinical research to acquire 
robust and effective sub-regions of glioblastoma for clinical decision support. 
The proposed FAE architectures significantly enhance the robustness of the clus- 
tering model and improve the quality of clustering results. Additionally, robust 
and reliable clustering solutions can be accomplished with minimal time invest- 
ment by integrating the entire process inside a BO framework and presenting 
a unique loss function for problem-specific multi-task optimization. Finally, the 
independent validation of our methodology using a different dataset strengthens 
its viability in clinical applications. 
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Although we have conducted numerous repeating trials, it is inevitable to 
eliminate the randomness for clustering algorithm experiments. In future work, 
we could include more modalities and datasets to test the framework. To enhance 
the clinical relevance, more clinical variables could be included into the BO 
framework for multi-task optimization. To summarise, the BO framework com- 
bined with the suggested FAE and mixed loss represents a robust framework for 
obtaining clustering results that are clinically relevant and generalizable across 
datasets. 


5 Appendix 


5.1 Details of Dataset and Imagine Processing 


Patients with surgical resection (July 2010-August 2015) were consecutively 
recruited, with data prospectively collected by the multidisciplinary team (MDT) 
central review. All glioblastoma patients underwent pre-operative 3D MPRAGE 
(pre-contrast T1 and T1C), T2-weighted FLAIR, pMRI and dMRI sequences. 
All patients have a radiogical diagnosis of de novo glioblastoma, aged 18 to 75, 
eligible for craniotomy and radiotherapy, and all images resolution were resam- 
pled to 1 x 1 x 1 më. 

Co-registration of the images was accomplished using the linear registration 
tool (FLIRT) included in the Oxford Centre for Functional MRI of the Brain 
Software Library (FSL) v5.0.0 (Oxford, UK) [5,23]. NordicICE was used to pro- 
cess dynamic susceptibility contrast (DSC), one of the most frequently utilised 
perfusion methods (NordicNeuroLab). The arterial input function was automat- 
ically defined. The diffusion toolbox in FSL was used to process the diffusion 
images (DTI) [1]. The isotropic (p) and anisotropic (q) components were com- 
puted after normalisation and eddy current correction [20]. 


5.2 Details for Clinical Features 


In this study, through the BO, the tumor were divided into 5 sub-regions as 
{Pn} from {Zw }} 1, the features processed by the well-trained FAE, where 
P,, = {pi}, pi € {1,2,3,4,5} denotes the sub-region labels for each pixel. 
Rather than representing the numerical grey value of images, the value of each p; 
represents sub-region labels, rendering the majority of features in the GLCM and 
GLRLM families invalid. Finally, the Table 2 summarises the selected features 
which remain meaningful for the label matrix. Eventually, the clinical features 
{F,,}4_,, where F, € R'™™? include 9 spatial characteristics in Table 2 and the 
fraction of 2 significant sub-regions. 
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Table 2. Clinical features from GLCM matrix of size Ng x Ng and GLRLM matrix 
of size Ng x N, family including Relative mutual information(RMI), Entropy, Joint 
Energy, Informational Measure of Correlation(IMC), Long Run Emphasis(LRE), Short 
Run Emphasis(SRE), Run Variance(RV) and Run Entropy(RE). p(i, j|@) in the formula 
column describes the probability of the (i, 7)th elements of matrices along angle 9, 
p= DAA pin p(t, j|0)i denotes the average run length of GLRLM matrix [15]. 


Feature name Formula Interpretation 
a ———__—_ —_ 7 -- aN Gs a cen en I a a os | PANO N OON OO AN 
RMI (L544 Py (Jj) logg py (J) +e)+ 5, L521 (p(i,3)) loge p(i|j) | Uncertainty coefficient in 
Ng 7 7 landspace pattern [16] 
-Z j=1 Py (j) loge py (i) te 
N, 

Entropy — DA p(i)loga(p(i) + €) The uncertainty/ 
randomness in the image 
values 

N N 
Joint Energy DERA Pe (p(i, j))? Energy is a measure of 


homogeneous patterns in 
the image 


AXY-HXY1 egs 
IMC max{HX,HY} Quantifying the 


complexity of the texture) 


N, 
LRE BREA EGIP LRE is a measure of 
Nr (0) the distribution of long 
run lengths 


N Nr i,j))2 
DAEA Si (PGi) 


SRE SRE is a measure of 
Nr (0) the distribution of short 
run lengths 
Non-uniformity SEA ey P(i,j|@)) Measures the 
O Np(0)2. OO” similarity of gray-level 
intensity values in the 
image 
RV DA ALA Pi, ilO) — p)? Measure of the variance 
in runs for the run 
lengths 
RE DAA POR p(i, j|0) loga (p(i, 318) + €) Measures the 
uncertainty /randomness 
in the distribution of run 
lengths 
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Abstract. Glioma is a common malignant brain tumor with distinct 
survival among patients. The isocitrate dehydrogenase (IDH) gene muta- 
tion provides critical diagnostic and prognostic value for glioma. It is 
of crucial significance to non-invasively predict IDH mutation based on 
pre-treatment MRI. Machine learning/deep learning models show rea- 
sonable performance in predicting IDH mutation using MRI. However, 
most models neglect the systematic brain alterations caused by tumor 
invasion, where widespread infiltration along white matter tracts is a 
hallmark of glioma. Structural brain network provides an effective tool 
to characterize brain organisation, which could be captured by the graph 
neural networks (GNN) to more accurately predict IDH mutation. 

Here we propose a method to predict IDH mutation using GNN, 
based on the structural brain network of patients. Specifically, we firstly 
construct a network template of healthy subjects, consisting of atlases 
of edges (white matter tracts) and nodes (cortical/subcortical brain 
regions) to provide regions of interest (ROIs). Next, we employ autoen- 
coders to extract the latent multi-modal MRI features from the ROIs 
of edges and nodes in patients, to train a GNN architecture for pre- 
dicting IDH mutation. The results show that the proposed method out- 
performs the baseline models using the 3D-CNN and 3D-DenseNet. In 
addition, model interpretation suggests its ability to identify the tracts 
infiltrated by tumor, corresponding to clinical prior knowledge. In conclu- 
sion, integrating brain networks with GNN offers a new avenue to study 
brain lesions using computational neuroscience and computer vision 
approaches. 
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1 Introduction 


1.1 Significance of Predicting IDH Mutational Status 


Gliomas are common malignant brain tumors with various prognosis [16]. The 
mutation status of isocitrate dehydrogenase (IDH) genes is one of the most 
important biomarkers for the diagnosis and prognosis of gliomas, where IDH 
mutants tend to have a better prognosis than IDH wild-types [29]. Due to the 
crucial value in clinical practice, IDH mutations have been established as one of 
the landmark molecular markers for glioma patients, recommended by the World 
Health Organization classification of tumors of the Central Nervous System for 
routine assessment in glioma patients [13]. 

Currently, the most widely used approaches to determine IDH mutation sta- 
tus, i.e., immunohistochemistry and targeted gene sequencing, rely on tumor 
samples [13], which therefore cannot be assessed on those patients who are not 
suitable for tumor resection or biopsy. Further, as the assays usually are time- 
consuming and expensive, they are not available in some institutions. 

Meanwhile, the radiogenomic approach has shown promise in predicting 
molecular markers based on radiological images. Mounting evidence has sup- 
ported the feasibility of predicting IDH mutation status using the pre-operative 
MRI [4,6,11]. The most commonly used MRI sequences include pre-contrast 
T1, post-contrast T1, T2, and T2-weighted-Fluid-Attenuated Inversion Recov- 
ery (FLAIR). Integrating the quantitative information from multi-modal MRI 
promises to provide a non-invasive approach to characterize glioma and predict 
IDH mutations for better treatment planning and prognostication [9,10]. 


1.2 Brain Structural Networks 


The tissue structure of the human brain is divided into grey matter and white 
matter. The grey matter, located on the brain surface, constitutes the cerebral 
cortex and can be parcelled into cortical/subcortical regions based on cortical 
gyri and sulci. The parcellation offers a more precise association between brain 
function with cortical structure. The white matter of the cerebral cortex contains 
the connecting axons among the cortical/subcortical regions. The structural net- 
work of the brain is a mathematical simplification of the connectivity of the cor- 
tical/subcortical regions [3], where the nodes represent the cortical/subcortical 
regions and the edges are defined as connecting white matter tracts. 
Accumulating research of structural brain networks has reported significance 
in neuropsychiatric diseases, including stroke, traumatic brain injury, and brain 
tumors [5, 12, 19,27]. On the other hand, evidence shows that glioma cells tend to 
invade along the white matter pathway [26] and infiltrate the whole brain [24, 27]. 
Therefore, investigating structural brain networks could offer a tool to investigate 
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glioma invasion on both tumor core and normal-appearing brain regions. Further, 
a previous study revealed that IDH mutations could be associated with different 
invasive phenotypes of glioma [18]. To this end, we hypothesize that employing 
the structural brain networks could provide value for predicting IDH mutation 
status. In particular, with prior knowledge of brain structure and anatomy incor- 
porated, a more robust prediction model could be achieved. 


1.3 Graph Neural Networks 


The graph neural networks (GNN) is a branch of deep learning, specialized 
in data formats of irregular structures, such as varying numbers of edges and 
random orders of nodes in graph data [14]. Unlike the traditional convolutional 
neural networks (CNN) that convolute elements one by one in the grid data, the 
GNN aggregate information into nodes from their neighbors and simultaneously 
learns a representation of the whole graph. By employing the GNN on structural 
brain networks, the topological information contained in the structural brain 
networks could be effectively explored, which would consequently incorporate 
the prior knowledge of brain organization and perceive the critical information 
of tumor invasion at the whole-brain level. 


1.4 Related Work 


Current methods of predicting IDH mutation status include radiomics/machine 
learning-based, deep learning-based, or a combination of both. Radiomics/ 
machine learning-based methods extract high dimensional handcrafted features 
from the MRIs, e.g., tumor intensity, shape, texture, etc., to train machine learning 
prediction models of molecular markers, tumor grades, or patient survival [6]. Deep 
learning-based approaches provide end-to-end model without pre-defined imaging 
features in the prediction tasks [11]. Some other methods integrated the radiomic 
features into a deep neural network to enhance prediction performance [4]. Albeit 
reasonable prediction accuracy, most of these methods are mainly driven by the 
computer vision tasks, without considering the systematic alteration of the brain 
organization during tumor invasion. Incorporating the prior knowledge from the 
neuroscience field shows promise to improve the prediction model. 


1.5 Proposed Methods 


Here we propose an approach of using GNN to predict IDH mutation status, 
based on the structural brain networks generated from multi-model MRI and 
prior human brain atlases. Our contributions include: 


— A method to incorporate the prior knowledge of brain atlases with the 
anatomical MRI to generate structural brain networks. 

— A novel architecture of GNN with specialized graph convolutional operator 
for aggregating multi-dimensional latent features of the multi-model MRI. 

— To our best knowledge, this is the first study that leverages GNN on the 
multi-modal MRI to predict the IDH mutation status of glioma. 
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2 Methods 


2.1 Datasets 


This study included the pre-operative multi-modal MRI (pre-contrast T1, post- 
contrast T1, T2, and FLAIR) of 389 glioma patients. MRI images of 274 patients 
were downloaded from The Cancer Imaging Archive (TCIA) website [17, 20, 
21], whereas 115 patients were available from an in-house cohort. 17 of 389 
patients who have missing IDH mutation status or incomplete MRI modalities 
were excluded. For the included patients, 103 patients are IDH mutant, and 269 
are IDH wild-type. 


2.2 Imaging Pre-processing 


We processed the multi-modal MRI following a standard pipeline [2]. Firstly, 
the T1, T2, and FLAIR were co-registered to the post-contrast T1 using the 
FMRIB’s Linear Image Registration Tool [8]. Then, brain extraction was per- 
formed on all MRI modalities to remove the skull using Brain Extraction Tool in 
the FMRIB Software Library (FSL) [7,22]. We also performed histogram match- 
ing [15] and voxel smoothing with SUSAN noise reduction [23]. A neurosurgeon 
and a researcher performed manual correction of brain masks, cross-validated 
using DICE score. Finally, all modalities were non-linearly co-registered using 
the Advanced Normalization Tools (ANTs) [1] to the MNI152 standard space, 
i.e., MNI-152-T1-2MM-brain provided by the FSL (Fig. 1A). 


2.3 Constructing Patient Structural Brain Networks 


Brain Network Template. We leveraged the brain network template derived 
from healthy subjects to construct brain networks in lesioned brains [19]. First, 
we used the prior brain atlases in healthy subjects as the template of brain 
networks, generating regions of interest (ROIs) for characterizing the brain net- 
works in patients based on multi-modal MRI. Specifically, we used the Auto- 
mated Anatomical Labelling (AAL) atlas [25] as the node ROIs (Fig. 1B), which 
includes 90 brain cortical and subcortical regions. Further, we generated a brain 
connectivity atlas from ten healthy subjects scanned by high-resolution diffusion 
MRI to derive the edge ROIs of the structural brain networks (Fig. 1C). We used 
a similar approach of generating brain connectivity atlas with [5,28]. In brief, 
firstly, pairwise tractography among the 90 regions of AAL atlas was performed 
in healthy subjects, then the resultant tract pathways were co-registered to the 
MNI152 standard space. Next, the corresponding tracts of all healthy subjects 
were averaged for each edge between two nodes. Finally, the top 5% voxels of 
the tract density were retained and binarized to generate robust edge ROIs. The 
generated edge atlas is shown in Fig. 1C. 
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Fig. 1. Study workflow. Upper: the pipeline of constructing patient brain networks. A: 
Patient MRIs are pre-processed and co-registered to the atlas space. B: The AAL atlas 
of 90 ROIs is used as the node atlas. C: The edge atlas is generated from performing 
tractography among the 90 ROIs on the diffusion MRI of healthy subjects. D & E 
Multi-modal MRI voxels within the node/edge ROIs are extracted and concatenated 
to voxel vectors to characterize the node/edge. 90 node were from AAL atlas while 2309 
edges are the edges that exist in 9 of 10 healthy subjects in tractography. F & G: Two 
autoencoders are trained using edge and node voxel vectors. H & I: Encoders of trained 
autoencoders are used to extract the low dimensional latent features z from the high 
dimensional node/edge voxels vector, respectively. J Latent node/edge features are then 
rearranged into graph format as the input of the GNN. K Graph convolutional neural 
networks consist of three hidden graph convolutional layers, one graph embedding layer, 
and two fully-connected (FC) layers. 


Latent Features of Nodes and Edges from Autoencoders. MRI voxels 
within the ROIs of the node or edge atlases across the whole brain were extracted 
and then concatenated to voxel vectors (Fig. 1D & E). We then used two autoen- 
coders to extract the latent features from the voxel vectors of node and edge, 
respectively. Vector size was set as 2500 (voxels) x 4 (modalities) = 10000. For 
edges and nodes with few voxels, the vectors were padding with zeros. The 
patient cohort was shuffled and split into a 80:20 ratio for training and testing 
data. Two autoencoders were trained by edge and node voxel vectors of the train- 
ing data (Fig. 1F & G). Finally, the latent features of edge or node voxels were 
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derived, with the dimension of the edge or node vectors substantially decreased 
from 10000 to 12 (Fig. 1H & I). The 12 latent features were used as the input of 
the GNN (Fig. 1J). Logistic sigmoid function was applied as transfer function for 
both encoder and decoder. L2 regularization with coefficient of 0.001 was used. 
‘msesparse’ was set as the loss function. 


2.4 Predicting IDH Mutation Status Using GNN 


The patient brain networks constructed above were used to train the GNN, with 
the multi-modal MRI latent features as inputs. In addition to the 80:20 ratio of 
training and testing data, training data was split again into an 80:20 ratio for 
cross-validation. The proposed GNN consist of three graph convolutional layers 
similar to the one defined in [14], one node to graph embedding layers, and two 
fully connected feed forward layers (Fig. 1K). We used a binary cross-entropy 
loss, while the optimization was done using Adam optimizer. 
The graph convolutional operator is defined as follow: 


Z 


x| = O1x;+ 5 Oz 5 Cj i,z ° Xj (1) 


z=1 JEN (i) 


where x; denotes the features of node i after convolution, ©; and ©% denote the 
trainable network weights. - is the multiply operator. e;;,, represents the zth 
edge feature from source node j to target node i. j € N (i) denotes all indices of 
nodes j connecting to node į with nonzero edge features. Z denotes the size of 
latent edge features. 

The graph embedding operator is defined as follow: 


N 
G4 =X ©x; (2) 
i=l 


where G% denotes the graph embedding of size Z all nodes of the graph. © 
denote the trainable network weights. N denotes the number of nodes in graph. 
Z denotes the size of latent node features. 

Random edge drop was applied to augment data during training. The 
weighted loss was applied in the network to mitigate the effect of data imbalance. 
Learning rate decay was used to stabilize the training process. Early stopping 
mechanism, weight decay, and dropout layers after fully connected layers were 
used to prevent over-fitting. 


2.5 Benchmark Models 


We adopted a three-dimensional Densely Connected Convolutional Networks 
(3D-DenseNet) (Fig. 2A) and a three-dimensional convolutional neural networks 
(3D-CNN) (Fig. 2B) as the benchmarks. Specifically, a classic 121-layer version 
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of 3D-DenseNet follows the architecture described in [11] while a traditional 3D- 
CNN with four hidden convolutional layers with batch normalization and pooling 
was applied, followed by a max-pooling layer and an output layer. Data were split 
using the same method as the GNN model. Weighted loss, learning rate decay, 
and early stopping are all applied, which was similar to the GNN settings. The 
same loss function and optimiser were applied to the benchmark models as the 
GNN model. Two experiments with different input were conducted: whole-brain 
MRI and MRI voxels inside tumor ROIs (contrasting-enhancing tumor core and 
necrosis) which are generated according to [2]. 


A DenseNet 
Dense block Dense block Dense block Dense block 
input images AA LEX, FC 
LTA HDA M Het = tout 
Conv3D Conv3D & Conv3D & Conv3D & Pooling & 
pooling pooling pooling linear 
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Pooling 512 


256 


Fig. 2. Architecture of the benchmark models. A. Classic three-dimensional Densely 
Connected Convolutional Networks (3D-DenseNet) consist of four convolutional layers 
and four densely connected blocks. B. Three-dimensional convolutional neural networks 
(3D-CNN) consist of four hidden convolutional layers with max-pooling and batch 
normalization, one global pooling layer followed by dropout, and one fully connected 
dense layer. 


3 Results and Discussion 


3.1 Model Performance 


Our experiments show that the proposed model performs better than the base- 
line models (Table 1) for both cross-validation and testing. Interestingly, the 
benchmark models with tumor voxels as inputs perform better than the models 
with the whole brain as inputs, which suggests the potential bias from the exten- 
sive brain regions beyond the local tumor. Of note, our proposed GNN model, 
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leveraging the brain network generated based on prior atlas and whole brain 
MRI, performs better than all the benchmark models, which may suggest that 
incorporating prior knowledge of brain networks could help the deep learning 
models capture more informative features regarding tumor invasion over either 
the local tumor or the whole brain. 


Table 1. Performances of IDH prediction models 


Methods Accuracy (%) | Sensitivity (%) | Specificity (%) 
Cross-validation 

3D-CNN + whole brain MRI 69.1 61.2 72.1 
3D-CNN + tumor ROIs 80.1 77.7 81.0 
3D-DenseNet + whole brain MRI | 76.1 67.0 79.6 
3D-DenseNet + tumor ROIs 84.1 86.4 83.3 
GNN + brain networks 87.9 97.4 88.1 
Test 

3D-CNN + whole brain MRI 67.2 63.1 68.8 
3D-CNN + tumor ROIs 78.2 75.7 79.2 
3D-DenseNet + whole brain MRI | 73.1 63.1 77.0 
3D-DenseNet + tumor ROIs 83.3 83.5 83.2 
GNN + brain networks 86.6 87.7 86.3 


3.2 Model Interpretation 


To interpret the learning process of the GNN model, we applied the GNNEx- 
plainer [30]. GNNExplainer outputs a probability score that infers the impor- 
tance of the edges in the prediction task and outputs a compact subnetwork of 
the networks. The task was achieved by maximizing both a graph neural net- 
work’s prediction and distribution of possible subnetworks. Only subnetworks 
with edges that have probability scores greater than 50% were retained. 

Overall, we observe that the IDH wild-type is associated with a wider dis- 
tribution of edge invasion, captured by the GNN model. Figure3 presents two 
typical cases of IDH mutant and wild-type, respectively, which also present the 
distribution of key white matter tracts (edges) that are important to the pre- 
diction accuracy. In line with our prior knowledge that IDH wild-type generally 
causes more widespread invasion, the results of the model interpretation could 
further support the usefulness the proposed GNN model. 
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Fig. 3. Examples of IDH mutant and wild-type. A IDH mutant B. IDH wild-type. For 
both patients, the left panels indicate the T1-weighted images and the right panels 
show the output of GNNExplainer, illustrating the voxel distribution of edges that 
have over 50% and 90% probability of importance in IDH mutation prediction. The 
tract density of a voxel is defined as the number of tracts crossing the voxel. 


4 Conclusion 


In this paper, we propose a method to generate brain networks based on multi- 
modal MRI and predict the IDH mutation status using GNN and the gener- 
ated brain networks. Numerical results demonstrate that the proposed method 
outperforms benchmark methods. In future work, we could use the radiomic 
approach to extract representative features from the node and edge ROIs. Fur- 
thermore, special end-to-end GNN models could be developed to directly take 
the high dimensional multi-modal MRI voxels as inputs. To conclude, combining 
brain networks with GNN promises to serve as a novel powerful tool for deep 
learning model development in radiogenomic studies. 
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Abstract. Brain extraction is an indispensable step in neuro-imaging 
with a direct impact on downstream analyses. Most such methods have 
been developed for non-pathologically affected brains, and hence tend 
to suffer in performance when applied on brains with pathologies, e.g., 
gliomas, multiple sclerosis, traumatic brain injuries. Deep Learning (DL) 
methodologies for healthcare have shown promising results, but their 
clinical translation has been limited, primarily due to these methods suf- 
fering from i) high computational cost, and ii) specific hardware require- 
ments, e.g., DL acceleration cards. In this study, we explore the potential 
of mathematical optimizations, towards making DL methods amenable 
to application in low resource environments. We focus on both the qual- 
itative and quantitative evaluation of such optimizations on an existing 
DL brain extraction method, designed for pathologically-affected brains 
and agnostic to the input modality. We conduct direct optimizations and 
quantization of the trained model (i.e., prior to inference on new data). 
Our results yield substantial gains, in terms of speedup, latency, through- 
put, and reduction in memory usage, while the segmentation performance 
of the initial and the optimized models remains stable, i.e., as quanti- 
fied by both the Dice Similarity Coefficient and the Hausdorff Distance. 
These findings support post-training optimizations as a promising app- 
roach for enabling the execution of advanced DL methodologies on plain 
commercial-grade CPUs, and hence contributing to their translation in 
limited- and low- resource clinical environments. 


Keywords: Low resource environment - Deep learning - 
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1 Introduction 


One of the most important first steps in any neuro-imaging analysis pipeline 
is brain extraction, also known as skull-stripping [1,2]. This process removes 
all non-brain portions in a brain scan and leaves the user with the portion of 
the image that is of maximal interest, i.e., the brain tissue and all associated 
pathologies. This step is an indispensable pre-processing operation that has a 
direct effect on subsequent analyses, and also used for de-identification purposes 
[3]. Enabling this to run on clinical workstations could have a tremendously pos- 
itive impact on automated clinical workflows. The effects of the quality of brain 
extraction in downstream analyses have been previously reported, for studies on 
tumor segmentation [4-6] and neuro-degeneration [7]. 

This study specifically focuses on glioblastoma (GBM), which is the most 
aggressive type of adult brain tumors. GBM has poor prognosis despite current 
treatment protocols [8,9], and its treatment and management is often problem- 
atic with a necessity of requiring personalized treatment plans. To improve the 
treatment customization process, computational imaging and machine learning 
based assistance could prove to be highly beneficial. One of the key steps for this 
would be to enable a robust approach to obtain the complete region of imme- 
diate interest irrespective of the included pathologies that would result in an 
improved computational workflow. 

While deep learning (DL) has been showing promising results in the field of 
semantic segmentation in medical imaging [4,10—17], the deployability of such 
models poses a substantial challenge, mainly due to their computational foot- 
print. While prior work on brain extraction has focused on stochastic modeling 
approaches [1,2,18], modern solutions leveraging DL have shown great promise 
[12,15]. Unfortunately, models trained for this application also suffer from such 
deployment issues, which in turn reduces their clinical translation. 

In recent years, well-known DL frameworks, such as PyTorch [19] and Ten- 
sorFlow [20] have enabled the democratization of DL development by making 
the underlying building blocks accessible to the wider community. They usu- 
ally require the help of moderately expensive computing with DL acceleration 
cards, such as Graphical Processing Units (GPUs) [21] or Tensor Processing 
Units (TPUs) [22]. While these frameworks will work on sites with such compu- 
tational capacity (i.e., GPUs and TPUs), deploying them to locations with low 
resources is a challenge. Most DL-enabled studies are extremely compute inten- 
sive, and the complexity of the pipeline makes them very difficult to deploy, 
especially in tightly controlled clinical environments. While cloud-based solu- 
tions could be made available, patient privacy is a major health system concern, 
which requires multiple legal quandaries to be addressed prior to uploading data 
to the cloud. However, the availability of such approaches for local inexpensive 
compute solutions would be the sole feasible way for their clinical translation. 

Quantizing neural networks can reduce the computational time required for 
the forward pass, but more importantly can reduce the memory burden during 
the time of inference. Post-quantization, a high precision model is reduced to a 
lower bit resolution model, thus reducing the size of the model. The final goal is 
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to leverage the advantages of quantization and optimization, while maintaining 
the segmentation performance of the full precision floating point models as much 
as possible. Such methods can facilitate the reduction of the required memory 
to save and infer the generated model [23]. 

In this paper, we take an already published DL method, namely Brain Mask 
Generator (BrainMaGe)! [15], and make it usable for low resource environments, 
such as commercial-grade CPUs with low memory, and older generation CPUs 
by leveraging the advantages of quantization and optimization for performance 
improvements. We provide a comprehensive evaluation of the observed perfor- 
mance improvements across multiple CPU configurations and quantization meth- 
ods for the publicly available TCGA-GBM dataset [6, 24,25], as well as a private 
testing dataset. 


2 Methods 


2.1 Data 


We identified and collected n = 864 multi-parametric magnetic resonance images 
(mpMRI) brain tumor scans from n = 216GBM patients from both private 
and public collections. The private collections included n = 364 scans, from 
n = 91 patients, acquired at the Hospital of the University of Pennsylvania 
(UPenn). The public data is available through The Cancer Imaging Archive 
(TCIA) [24] and comprises of the pre-operative mpMRI scans of The Cancer 
Genome Atlas Glioblastoma (TCGA-GBM, n = 125) [6,25] collection. The final 
dataset (Table 1) included n = 864 mpMRI scans from n = 216 subjects with 
4 structural modalities for each subject available, namely Tl-weighted pre- & 
post-contrast (T1, & T1Gd), T2-weighted (T2) and T2 fluid attenuated inversion 
recovery (FLAIR). Notably, the multi-institutional data of the TCGA-GBM 
collection is highly heterogeneous, including scan quality, slice thickness between 
different modalities, scanner parameters. For the private collection data, the T1 
scans were taken with high axial resolutions. The brain masks for the private 
collection data were generated internally and went through rigorous manual 
quality control, while the brain masks for the TCGA-GBM data were provided 
through the International Brain Tumor Segmentation (BraTS) challenge [4-6, 
26-28]. 


2.2 Data Pre-processing 


All DICOM scans were converted to the Neuroimaging Informatics Technology 
Initiative (NIfTI) [29] file format to facilitate computational analysis, following 
the well-accepted pre-processing protocol of the BraTS challenge [4—6, 26-28]. 
Specifically, all the mpMRI volumes were reoriented to the left-posterior-superior 
(LPS) coordinate system, and the T1Gd scan of each patient was rigidly (6 
degrees of freedom) registered and resampled to an isotropic resolution of 1mm? 


1 https: //github.com/CBICA/BrainMaGe. 


154 S. P. Thakur et al. 


Table 1. The distribution of all the datasets used in the study. 


Dataset No. of subjects | No. of mpMRI scans 
TCGA-GBM | 125 500 
UPenn 91 364 
Total 216 864 


based on a common anatomical atlas, namely SR124 [30]. We chose this atlas 
[30] as the common anatomical space, following the convention suggested by the 
BraTS challenge. The remaining scans (i.e., T1, T2, FLAIR) of each patient were 
then rigidly co-registered to this resampled T1Gd scan by first obtaining the rigid 
transformation matrix to T1Gd, then combining with the transformation matrix 
from T1Gd to the SRI24 atlas, and resampling. For all the image registrations 
we used the “Greedy”? tool [31], which is a central processing unit (CPU)- 
based C++ implementation of the greedy diffeomorphic registration algorithm 
[32]. Greedy is integrated into the ITK-SNAP?® segmentation software [33,34], as 
well as into the Cancer Imaging Phenomics Toolkit (CaPTk)* [35-39]. We fur- 
ther note that use of any non-parametric, non-uniform intensity normalization 
algorithm [40-42] to correct for intensity non-uniformities caused by the inho- 
mogeneity of the scanner’s magnetic field during image acquisition, obliterates 
the T2-FLAIR signal, as it has been previously reported [5]. Thus, taking this 
into consideration, we intentionally apply the N4 bias field correction approach 
[41] in all scans temporarily’ to facilitate an improved registration of all scans 
to the common anatomical atlas. Once we obtain the transformation matrices 
for all the scans, then we apply these transformations to the non-bias corrected 
images. This complete pre-processing is available through CaPTk, as officially 
used for the BraTS challenge (Fig. 1). 


2.3 Network Topology 


We have used the 3D implementation [10], of the widely-used network topology 
of U-Net [44], with added residual connections between the encoder and the 
decoder, to improve the backpropagation process [10,13,15,44—46]. The actual 
topology used here is highlighted in Fig. 2. The U-Net topology has been exten- 
sively used in semantic segmentation of both 2D and 3D medical imaging data. 
The U-Net consists of an encoder, which contains convolutional layers and down- 
sampling layers, a decoder offering upsampling layers (applying transpose con- 
volution layers), and convolutional layers. The encoder-decoder structure con- 
tributes towards automatically capturing information at varying resolutions and 
scales. There is an addition of skip connections, which includes concatenated fea- 
ture maps paired across the encoder and the decoder layer, to improve context 


? github.com/pyushkevich/greedy, hash: 1a871c1, Last accessed: 27/May /2020. 
3 itksnap.org, version: 3.8.0, last accessed: 27/May /2020. 
4 www.cbica.upenn.edu/captk, version: 1.8.1, last accessed: 11/February/2021. 
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Fig. 1. Example of MRI brain tumor scan from a randomly selected subject from the 
test set. The Original scans (A) include the skull and other non-brain tissues, and (B) 
the corresponding scan slices depicting only the brain. 


and feature re-usability. The residual connections utilize additional information 
from previous layers (across the encoder and decoder) that enable a segmentation 
performance boost. 
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4 Ah A Final Layer 
6464 64 64 64 6: 


Fig. 2. The U-Net topology with residual connections from GaNDLF was used for this 
study. Figure was plotted using PlotNeuralNet [43]. 


2.4 Inference Optimizations 


In this work, we used the Open VINO toolkit (OV) for the optimizations of the 
BrainMaGe model. First, in order to provide estimates of scalability of the model 
performance in low resource environments, we conduct a comparison between 
the inference performance of the optimized OV model with that of the PyTorch 
framework. We further show a comparison of the optimized model performance 
across various hardware configurations typically found in such environments. 
We then showcase further performance improvements obtained through post- 
training quantization of the model and perform similar comparisons across differ- 
ent hardware configurations. In summary, for the BrainMaGe model, we explored 
both (i) conversion from PyTorch to the optimized model with an additional 
intermediate conversion to ONNX, which lead to an intolerable accuracy drop 
during the PyTorch to ONNX conversion step, and (ii) direct conversion from 
PyTorch to the model’s optimized intermediate representation format. 


2.4.1 OpenVINO Toolkit 

OV is a neural network inference optimization toolkit [47], which provides infer- 
ence performance optimizations for applications using computer vision, natu- 
ral language processing, and recommendation systems, among others. Its main 
components are two: 1) A model optimizer and 2) an inference engine. The OV 
model optimizer, provides conversion from a pre-trained network model trained 
in frameworks (such as PyTorch and TensorFlow) into an intermediate represen- 
tation (IR) format that can be consumed by its second main component, i.e., its 
inference engine. Other types of formats that are supported include the ONNX 
format. Hence, for frameworks like TensorFlow and PyTorch, there is an inter- 
mediate conversion step that can be performed offline. While support for direct 
conversion from the PyTorch framework is limited, there are specific extensions 
[48] that enable this. The OV inference engine, provides optimized implementa- 
tions for common operations found in neural networks, such as convolutions, and 
pooling operations. OV also provides graph level optimizations, such as opera- 
tor fusion and optimizations for common neural network patterns through the 
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Neraph library [49]. These optimizations can provide direct improvements in the 
execution time of the model, enabling the latter for low- (or limited-) resource 
environments with tight compute constraints. 


2.5 Network Quantization 


Quantization is an optimization technique that has been adopted in recent times, 
to improve inference performance of neural network models [50,51]. It involves a 
conversion from a high precision datatype to a lower-precision datatype. In this 
study, we specifically discuss the quantization of a 32-bit floating point (FP32) 
model to an 8-bit integer (INT8) model as provided by Eq. 1: 


Outrnrs = round(scale * Inpp32 + Zerdof fset) (1) 


where the scale factor provides a mapping of the FP32 values to the low-precision 
range. The zerooffset provides a representation of the FP32 zero value to an 
integer value [52,53]. 

We have explored leveraging quantization for further improvements in infer- 
ence, while maintaining the model’s segmentation performance. Quantization 
has many benefits, including (i) speedup improvements, and (ii) reduction of 
memory utilization. There are two popular approaches to model quantization, 
namely: 


1. Quantization-aware training [54], which involves training the neural net- 
work with fake quantization operations inserted in the network graph. The 
fake quantization nodes are able to learn the range of the input tensors and 
hence this serves as a simulation of the quantization. 

2. Post-training quantization [55], which is the idea where the quantization 
process is performed post-training, but prior to the actual inference. A subset 
of the training dataset is selected for calibration, and this dataset is used to 
learn the minimum and maximum ranges of the input weights and activations 
for tensor quantization. 


In this study, we have focused on exploring post-training quantization using 
the OV AccuracyAware technique [56], which provides model optimizations 
while explicitly limiting the segmentation performance drop. The intuition of 
the method is that the quantization is targeted towards all eligible layers in the 
topology. However, if a segmentation performance drop is observed, greater than 
the user-specified threshold, the layers that contribute the most to the segmen- 
tation performance drop are iteratively reverted back to the original datatype, 
until the desired segmentation performance level is achieved. 


2.5.1 Quantitative Evaluation 
The segmentation performance of the model is quantitatively evaluated accord- 
ing to (i) the Dice Similarity Coefficient [57] (a widely used and accepted metric 
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for quantifying segmentation results [58]), (ii) the 95t” percentile of the (sym- 
metric) Hausdorff Distance (commonly used in Biomedical Segmentation chal- 
lenges) (iii) memory utilization, and (iv) inference performance (latency). We 
further report the model performance for each stage of optimization, i.e., for the 
1) baseline PyTorch implementation, 2) OV optimized FP32 model, and 3) OV 
optimized model converted to INT8 format through the post-training quantiza- 
tion step (Table 4). It is important to note that quantization to lower precision 
formats, such as INT8, typically results in a small drop in segmentation per- 
formance but this is highly dependent on the dataset. In our case, we do not 
notice any loss in segmentation performance after converting the model to the 
OV optimized model format. 


2.6 Experimental Design 


In favor of completeness, we chose five hardware platforms from various CPU 
generations, to benchmark our various model configurations. We ran inference 
benchmarks on all five hardware platforms with n = 132 images from the TCGA- 
GBM dataset. The results are reported based on average of running inferences 
on these images with a batch size of n = 1. See Tables2 and 3 for the detailed 
hardware and software configurations. 


Table 2. The detailed hardware configurations used in for our experiments. Hyper- 
threading and turbo was enabled for all. 


Config 1 Config 2 Config 3 Config 4 Config 5 

Platform Kaby Lake Coffee Lake Tce Lake -U Tiger Lake Cascade Lake 

CPU Core(TM) i5-7400 Core(TM) X-GOLD 626 | Core(TM) i7-1065G7 | Gore(TM) i7-1185G7 | Xeon(R) Gold 6252N 
CPU @ 3.00 GHz CPU @ 2.60 GHz CPU @ 1.30 GHz CPU @ 3.00 GHz CPU @ 2.30 GHz 

# Nodes, di 1,1 Ei mE 1,2 

# Sockets 

Cores/socket, 4,4 8, 16 4,8 4,8 24, 48 

Threads/socket 

Mem config: DDR4, 2, 4GB, DDR4, 2, 8GB, LPDDR4, 2, 4GB, DDR4, 2, 8GB, DDR4, 12, 16GB, 

type, slots; 2133MT/s 2667 MT/s 3733 MT/s 3200 MT/s 2933 MT/s 

cap, speed 

Total memory 8GB 16GB 8GB 16GB 192GB 

Advanced AVX2 AVX2 AVX2, AVX512, AVX2, AVX512, AVX2, AVX512, 

technologies DL Boost (VNNI) DL Boost (VNNI) DL Boost (VNNI) 

TDP 90W 95W 15W 28W 150W 


Table 3. Details of the topology implementation. We used the 3D-ResU-Net architec- 
ture with 1 input channel, 2 output classes, and number of initial filters as 16. 


Framework | Open VINO 2021.4 PyTorch 1.5.1, 1.9.0 
Libraries nGraph/MKLDNN MKLDNN 

Model Resunet-ma.xml, Resunet-ma.bin | Resunet-ma.pt 
Input shape | (1, 1, 128, 128, 128) (1, 1, 128, 128, 128) 
Precision FP32, INT8 FP32, INT8 
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3 Results 


Of particular interest are the results obtained using the Hardware Configuration 
4 (Core(TM) i7-1185G7 @ 3.00 GHz machine), which describes the current gen- 
eration of hardware available in the consumer market. We further summarize the 
results obtained from all hardware configurations, in Fig. 3. Table4 shows the 
summary of these metrics running on the hardware configuration 4, using the 
n = 132 images from the public dataset. We also compare the results obtained 
using PyTorch v.1.5.1 and PyTorch v.1.9.0. Notably the dynamic quantization 
methodology on PyTorch v.1.9.0 did not yield any performance improvement. 
With FP32 precision, the performance between the PyTorch and the OV models 
is identical. Although memory utilization is slightly better with PyTorch v.1.9.0, 
the inference performance (latency) is 1.89x better with OV. When assessing 
the INT8 quantized/OV model, the performance drop is negligible, with compa- 
rable memory utilization, but with a 6.2x boost in ‘latency’, when compared to 
PyTorch v.1.9.0. The memory utilization and the model performance are similar 
across the hardware configurations, with some variations in ‘latency’. On the 
client hardware platforms (Configurations 1, 2, 3, and 4), with OV FP32 preci- 
sion, we observed up to 2.3x improvements in latency. The OV INTS8 precision 
yielded further speedups up to 6.9x. On server hardware platforms (Configu- 
ration 5), with OV FP32 precision, we observed upto 9.6x speedup and with 
the INTS8 precision we observed a speedup up to 20.5x. Figure3 illustrates 
the speedup per configuration, and Fig.4 highlights some example qualitative 
results. The additional boost in performance with INT8 quantized model in Con- 
figurations 3, 4, and 5, is due to the hardware platform’s advanced features, i.e., 
AVX512 & Intel DL Boost technology [59,60]. 


Table 4. Summary of accuracy, memory utilization and performance (latency) on the 
hardware configuration 4: Core(TM) i7-1185G7 @ 3.00 GHz. 


DL framework | Version | Precision | Average Average Memory Avg. latency 
dice score Hausdorff utilization speedup 
distance (normalized) | (normalized) 
PyTorch 1.5.1 FP32 0.97198 2.6577 + 3.0 1 1 
1.9.0 FP32 0.97198 2.6577 + 3.0 | 0.769 3.8 
OpenVINO 2021.4 | FP32 0.97198 2.6577 + 3.0 1.285 7.1 
INT8 0.97118 2.7426 + 3.1 | 0.907 23.3 
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Fig. 3. Speedup across different platforms using all the cores available on a processor. 


3.1 Core Scaling Improvements Across Various CPUs 


Additionally, we performed a core scaling performance benchmarking to deter- 
mine the scalability aspects of the model and the hardware. By limiting the 
number of threads to run the inference, we performed benchmarking on all the 
hardware configurations. Figure 5 shows a trend of increased performance with 
the increase in the number of threads. A slight drop in speedup can be observed 
if the number of threads assigned is greater than the number of physical cores. 
This is due to the imbalance and over-subscription of the threads. When vary- 
ing the number of threads for inference, the memory utilization and accuracy 
are similar to running on all the threads available. The performance of both 
the PyTorch and the OV models improved with the increase in the number 
of threads allocated to the inference. However, the speedup achieved with the 
OV optimized FP32 and INT8 models, over PyTorch, is substantial and can be 
observed on all hardware configurations. Figure 5f shows the average inference 
time speedup achieved by limiting the number of threads on different hardware 
configurations. 


4 Discussion 


In this study, we investigated the potential contributions of mathematical opti- 
mizations of an already trained Deep Learning (DL) segmentation model, to 
enable its application in limited-/low-resource environments. We specifically 
focused on a MRI modality agnostic DL method, explicitly designed and devel- 
oped for the problem of brain extraction in the presence of diffuse gliomas [14, 15]. 
We explored these mathematical optimizations, in terms of their potential model 
improvements on 1) execution time, for different hardware configurations (i.e., 
speedup, Fig. 3), 2) speedup, as a function of increasing number of CPU cores 
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Fig. 4. Qualitative comparison of results for one of the subjects with high resolution 
T1 scans across the 3 visualization slices. “GT” is the ground truth mask, “PT-FP32” 
is the mask generated by the original PyTorch FP32 model, “OV-FP32” is the output 
of the optimized model in FP32, and “OV-INT8” is the output of the optimized model 
after quantizing to INT8. 
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Fig. 5. Core scaling performance improvements, across various hardware configura- 
tions, shown in (a-e). The average speedup across all hardware configurations, and 
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(c) Hardware Configuration 3 
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(d) Hardware Configuration 4 
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(f) Summary speedup comparisons. 


comparison with the PyTorch baseline performance (f). 


for all the hardware configuration we considered (Fig.5), 3) memory require- 
ments (Table 4), and 4) segmentation performance. Our results yield a distinct 
speedup, and a reduction in computational requirements, while the segmentation 
performance remains stable, thereby supporting the potential of the proposed 


solution for application in limited-/low-resource environments. 
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For these intended inference time optimizations (i.e., applied in the already 
trained model), we have particularly focused on using the post-training quantiza- 
tion technique. We observe that the largest improvement in terms of speedup was 
obtained from the post-training quantized INT8 model, which ended up being 
> 23a faster than the native single-precision implementations, while producing a 
negligible segmentation performance drop as measured by both the Dice Similar- 
ity Coefficient and the Hausdorff distance (Table 4). Post training quantization 
is the quickest method of obtaining the quantized INT8 model and is desirable in 
situations where the “accuracy” (i.e., segmentation performance) drop is mini- 
mal, as well as within an acceptable threshold. In scenarios where the “accuracy” 
drop is greater than the acceptable threshold, quantization aware training could 
be an alternative approach to help in obtaining such potential improvements. 
However, such optimization (quantization aware training) would require model 
re-training. 

The total number of parameters of the BrainMaGe 3D-ResU-Net model are 
8.288 x 10°, for which the number of Floating point operations per second (Flops) 
required for the OV FP32 model are 350.72665 x 10°, whereas for the OV INT8 
model the number of Flops required are 2.09099 x 10° and number of Integer 
operations per second (Iops) required are 348.63566 x 10°. We observed that 
approximately 99.4% of Flops have been converted to Iops in the optimized 
INTS8 model, resulting in two major computational benefits: (2) With lower pre- 
cision (INT8), there is an improved data transfer speed through the memory 
hierarchy due to better cache utilization and reduction of bandwidth bottle- 
necks, thus enabling to maximize the compute resources; (ii) With hardware 
advanced features [59,60], the number of compute operations per second (OPS) 
are higher, thus reducing the total compute time. These two benefits of reduced 
memory bandwidth and higher frequency of OPS with the lower precision model 
resulted in substantial improvements (Table 4). 

In favor of transparency and reproducibility, we make publicly available the 
optimized BrainMaGe brain extraction model, through its original repository”. 
Furthermore, a more generalized solution will also be made publicly available 
through the Generally Nuanced Deep Learning Framework (GaNDLF)® [13], 
towards enabling scalable end-to-end clinically-deployable workflows. 

We consider the immediate future work as a three-fold: 1) performance eval- 
uation of quantization aware training compared against post-training quantiza- 
tion; 2) extended evaluation on a larger multi-institutional dataset [61,62], as 
well as evaluation of additional network topologies; 3) a comprehensive anal- 
ysis covering additional hardware configurations; 4) assessment of the poten- 
tial contributions of these mathematical optimizations for varying DL work- 
loads, beyond segmentation and towards regression and classification tasks in 
the healthcare domain. 


5 https://github.com/CBICA /BrainMaGe. 
6 https: //github.com/CBICA/GaNDLF. 
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Abstract. This paper proposes an adversarial learning based training 
approach for brain tumor segmentation task. In this concept, the 3D 
segmentation network learns from dual reciprocal adversarial learning 
approaches. To enhance the generalization across the segmentation pre- 
dictions and to make the segmentation network robust, we adhere to the 
Virtual Adversarial Training approach by generating more adversarial 
examples via adding some noise on original patient data. By incorporat- 
ing a critic that acts as a quantitative subjective referee, the segmen- 
tation network learns from the uncertainty information associated with 
segmentation results. We trained and evaluated network architecture on 
the RSNA-ASNR-MICCAI BraTS 2021 dataset. Our performance on the 
online validation dataset is as follows: Dice Similarity Score of 81.38%, 
90.77% and 85.39%; Hausdorff Distance (95%) of 21.83mm, 5.37mm, 
8.56 mm for the enhancing tumor, whole tumor and tumor core, respec- 
tively. Similarly, our approach achieved a Dice Similarity Score of 84.55%, 
90.46% and 85.30%, as well as Hausdorff Distance (95%) of 13.48 mm, 
6.32 mm and 16.98 mm on the final test dataset. Overall, our proposed 
approach yielded better performance in segmentation accuracy for each 
tumor sub-region. Our code implementation is publicly available. 


Keywords: Deep learning - Brain tumor segmentation - Medical 
image segmentation - Generative Adversarial Network - Virtual 
Adversarial Training 


1 Introduction 


Segmentation accuracy on boundaries is essential in medical image segmenta- 
tion as it is crucial for many clinical applications, such as treatment planning, 
disease diagnosis and image guided intervention to name a few. Tremendous 
progress in deep learning algorithms in dense pixel level prediction tasks has 
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recently drawn attention on implementing automatic segmentation applications 
for brain tumor/giloma segmentation. Gliomas considered as the most common 
brain tumor variant in adults. Diagnosing High-Grade Gliomas (HGG) in early 
phases which are more malignant (since they usually grow fast and frequently 
destroy healthy brain tissue) is essential for treatment planning. On the other 
hand Low-Grade Gliomas (LGG) are slower growing tumors which can be cured 
if it is diagnosed in early phases. However, segmenting tumor sub regions from 
various medical images modalities (e.g., MRI and CT) is a monotonous process 
which is time consuming and subjective. Medical Imaging analysis is carried out 
by radiologists and this manual process is tedious since the volumes are hefty 
in size and contains heterogeneous ambiguous sub-regions (i.e. edema, active 
tumor structures, necrotic components, and non-enhancing gross abnormality). 
In particular, medical image segmentation plays a cornerstone role in computer 
aided diagnosis. With the recent development in computer vision algorithms in 
deep learning, there has been many discoveries on automatic medical image seg- 
mentation. Multi-modal brain tumor segmentation challenge (BraTS) has been 
one of the platforms for many discoveries for many years. During the last decade, 
variants of Fully convolutional networks (FCN) and Convolutional Neural Net- 
work (CNN) based architectures have shown convincing performance in previous 
BraTS and other segmentation challenges. Recent developments in volumetric 
medical image segmentation networks like 3D-Unet [6] and V-Net [14] has been 
widely used with medical image modalities since these networks produce pre- 
dictions for different planes(7.e. axial (divides the body into top and bottom 
halves), coronal (perpendicular), and sagittal (midline of the body)). 

The main limitation of implementing and training these volumetric neu- 
ral network architectures is out-ofmemory (OOM) issues and extending these 
architectures are not feasible due to computational resource constraints. Many 
researchers have shown that, with a carefully crafted pre-processing, training 
and inference procedure, segmentation accuracy of 3D-UNet can improve fur- 
ther. By considering those factors like OOM issues, resource limitations, infer- 
ence time, we propose an approach to tackle these challenges and further improve 
the segmentation accuracy and training process of 3D-UNet architecture [6]. In 
summary, our major contributions are, 


1. Inspired by adversarial learning techniques, we propose two way adversarial 
learning to segment brain tumor sub regions in multi-modal MR images. 

2. We introduce a volumetric discriminator model which can explicitly show the 
confidence towards the current prediction to impose a higher-order consis- 
tency measure of prediction and ground truth during training. 

3. We introduce Virtual Adversarial Training (VAT) during model training to 
enhance the model’s robustness to data artefacts. 
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2 Related Work 


2.1 Medical Image Segmentation 


The rapid development of deep Convolutional Neural Networks and U-shaped 
encoder decoder architectures have shown convincing performance in medical 
image segmentation. The celebrated work U-Net [18] has shown a novel direc- 
tion to automatic medical image segmentation as it exploits both spatial and 
contextual information of images which greatly affect accuracy of segmentation 
models. Due to the simplicity and superior performance U-Net, many variants of 
U-shaped architectures are constantly emerging, such as Res-UNet [20], H-Dense- 
UNet [11], U-Net++ [22] and Attention-UNet [16]. Later, to handle volumetric 
medical image segmentation models are introduced into the field of 3D medical 
image segmentation, such as 3D-Unet [6] and V-Net [14]. 


2.2 Adversarial Learning 


Generative Adversarial Networks (GANs) [8] by Goodfellow has been a major 
breakthrough in the image generation task. Inspired by GAN approach, many 
GAN based medical imaging applications were introduced recently including in 
the areas of medical image segmentation [12], reconstruction [17] and domain 
adaptation [21]. In BraTS challenge 2020, Marco et al. proposed 3D volume-to- 
volume Generative Adversarial Network for segmentation of brain tumours [7] 
where the discriminator is build based on PatchGAN [9] architecture style. VAT 
is another adversarial learning approach which has shown tremendous perfor- 
mance in semi-supervised learning [15]. VAT is applicable to any parametric 
model and it directly regularizes the output distribution by its local sensitivity 
of the output with respect to input [15]. 

Hence, inspired by above works, we propose min-max formulation with VAT 
for segmenting brain tumors in multi-modal MR images. 


3 Methodology 


We start this section by providing an overview of the BraTS dataset and pro- 
posed method as shown in Fig. 2. Then we detail out the structure of each module 
and the entire training pipeline. 


3.1 Dataset 


The Magnetic Resonance images used for the model training and evaluation are 
from the Multi-modal Brain tumour Segmentation Challenge (BraTS) 2021 [2- 
5,13]. The BraTS 2021 training dataset contains 1251 MR volumes of shape 240x 
240 x 155. MRI is required to evaluate tumor heterogeneity. These MRI sequences 
are conventionally used for giloma detection: T1 weighted sequence (T1), T1- 
weighted contrast enhanced sequence using gadolinium contrast agents (T1Gd) 
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(TICE), T2 weighted sequence (T2), and Fluid attenuated inversion recovery 
(FLAIR) sequence. From these sequences, four distinct tumor sub-regions can 
be identified from MRI as: The Enhancing Tumor (ET) which corresponds to 
area of relative hyper-intensity in the TICE with respect to the T1 sequence, 
Non Enhancing Tumor (NET), Necrotic Tumor (NCR) which are both hypo- 
intense in T1-Gd when compared to T1, Peritumoral Edema (ED) which is 
hyper-intense in FLAIR sequence. These almost homogeneous sub-regions can 
be clustered together to compose three semantically meaningful tumor classes as, 
Enhancing Tumor (ET), addition of ET, NET and NCR represents the Tumor 
Core (TC) region and addition of ED to TC represents the Whole Tumor (WT). 
MRI sequences and ground truth map with three classes are shown in Fig. 1. 


Flair TICE GT 


Fig.1. Visual analysis of BraTs 2021 training data. In the Ground Truth 
(GT) Mask, green, yellow and gray represent the peritumoral edema (ED), Enhanc- 
ing Tumor (ET) and non enhancing tumor/necrotic tumor (NET/NCR), respectively. 
(Color figure online) 


3.2 Problem Formulation 


Let X = {(X;, Y;)}%%, be a labeled set with m number of samples, where 
each sample (X;, Y;) consists of an image X; € R©*?*#*™ and its associ- 
ated ground-truth segmentation mask Y; € {0,1,2,4}°*%7*™. Pixels with 0,1,2 
and 4 in label-map represent the background/air, Necrotic (NCR) and Non- 
enhancing tumor core (NET), Peritumoral Edema (ED) and Enhancing Tumor 
(ET). 


3.3 Network Architecture 


The proposed network architecture consists of three modules, namely a seg- 
mentation network, a critic network and Virtual adversarial Training (VAT) 
block. The segmentation network (i.e., F(-)) composed of down-sampling and 
up-sampling layers with skip pathways, making it a U like network architec- 
ture [18]. Critic is constructed as a fully convolutional adversarial network. Both 
networks consists 3D convolutions. The critic constructively impose the segmen- 
tation network to predict segmentation masks that are more similar to ground 
truth masks. The critic here, depicts Markovian PatchGAN architecture [9,10]. 


Reciprocal Adversarial Learning for Brain Tumor Segmentation 175 


In the original work Markovian PatchGAN architecture enables producing confi- 
dence scores for prediction masks. Inspired by this, we adapt the similar approach 
to provide uncertainty information to the segmentation network. The VAT block 
generates adversarial examples, so that the segmentation network can learn to 
avoid making such incorrect predictions on new patient data and patient data 
with artefacts. 
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Fig. 2. Proposed overall network architecture. F(-) and w(-) denote the Segmen- 
tation network and the Critic network. X, Y, raav and Y are input data (original patient 
data), ground truth segmentation masks, perturbation added on input data and the 
prediction generated from segmentation network. Here, Critic criticizes between pre- 
diction masks and the ground truth masks to perform the min-max game by generating 
a pixel-wise confidence map. VAT block improves the robustness of the model against 
generated adversarial examples by adding perturbation that violates the virtual adver- 
sarial direction. 


3.4 Objective Function 


The parameters of segmentation network is defined as 6g and the critic network 
is 0c. To encourage the segmentation network to yield predictions closer to the 
ground truth real masks by deceiving a critic network, we propose optimizing 
the following min-max problem: 

min max L(g; X). (1) 


0G lc 
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We propose to train the segmentation network by minimizing a the total loss 
function which consists of three terms: 


L(O; x) = Às Leaice(OG; x) + Ay Luat(9a; x; adv) + Ae Laav(9G3 0c; x) ’ (2) 


where Laice, Lvat, and Lady denote the supervised dice loss, the virtual adver- 
sarial training loss and the critic loss respectively. Furthermore, As, Ay, Ac > 0 
are hyper-parameters of the algorithm, controlling the contribution of each loss 
term. It can be seen that the supervised dice loss and vat loss are only depen- 
dent on the segmentation networks while the critic loss is defined based on the 
parameters of the entire model. The segmentation network works robustly and 
shows generalization performance as long as these parameters are defined in a 
reasonable range. In our experiments we set As = 1.0, Ay = 0.2 and A, = 0.3. 

As the main loss, we use dice loss and we calculate dice loss for each class 
(Multi-class loss function): 


(Y, Ý) +e 
Ixi + IY +e 


where we use (A,B) = 90; j Ali, j, k]B[i, j, k], Alhi = dos 5.4 Afi, j, k]| and +e 
is the smoothing factor (set to 1 in our experiment). 

VAT is an algorithm that updates the model by the weighted sum of the gra- 
dient of the regularization term which is the second loss term of our full objective 
function. Lyat is a non-negative function that measures the divergence between 
ground truth distribution and perturbed prediction distribution. Inspired by the 
VAT method by Takeru et al. [15], we define the divergence based Local Distri- 
butional Smoothness (LDS) as 


(3) 


Laice(Oa; ¥) = 1 — Ex, yjz | 


Lyat (0a; X; Tadv) = E(x yywx | D«1(¥||F (0a, X + radv))| - (4) 


Minimizing Lvat improves the generalization performance of the model and 
makes the model more robust against the adversarial examples that violates 
the virtual adversarial direction. Instead of having heavy data augmentation on 
the dataset with images perturbed by regular deformation we use adversarial 
perturbation which reduces the test error [19]. 

We denote the functionality of the critic by W : [0,1)4*” — [0,1]4*™ and 
define the normalized loss of critic for prediction distribution as: 


Ladv(9G3 0c; X) = xxv] - ye fa- n) log (WY )la, d) 


acH beEW 


+ nlog (1 - WPa, 0]) } ; (5) 


where 7 = 0 if the sample is generated by the segmentation network, and 7 = 1 
if the sample is drawn from the ground truth labels. With this adversarial loss, 
segmentation network tries to deceive the critic by generating predictions that 
are more similar to ground truth masks holistically. 
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4 Experiments 


4.1 Implementation Details 


The proposed model is developed in PyTorch and trained from scratch. We use 
modified version of 3D UNet as the segmentation network and a 3D discriminator 
as the critic network. In the 3D UNet, contracting path comprises five layers 
including bottleneck and each consisted of two 3 x 3 x 3 convolutions together 
with group normalization and ReLu activation. The number of feature maps in 
the first encoder is predefined as 48. The down-sampling layer consists a Max 
pooling operation with a kernel size of 2 x 2 x 2 with stride 2. Blocks of expansive 
path consists performs up-sampling using the trilinear interpolation followed by 
3x 3x3 convolution. Final layers consists a convolutional layer of a 1 x 1x1 
kernel with 3 output channels and a sigmoid activation. Skip connections between 
contracting and expansive path lead to concatenation of corresponding outputs. 
3D discriminator consists 4 3 x 3 x 3 convolutions with batch normalization and 
leaky ReLu activation function. Discriminator here is implemented, inspired by 
PatchGAN [9] where cubic size is 1 x 1 x 1. 


Image Pre-processing. Intensities of MRI volumes are inconsistent due to var- 
ious factors such as motions of patients during the examination, different man- 
ufacturers of acquisition devices, sequences and parameters used during image 
acquisition. To standardize all volumes, min-max scaling was performed followed 
by clipping intensity values. Images were then cropped to a fixed patch size of 
128 x 128 x 128 by removing unnecessary background pixels. 


Training. For training of segmentation network we use Adam optimizer with 
the learning rate of 2e—04 and for training of critic network, we use RMSProp 
optimizer with the learning rate of 5e—05 as momentum based methods cause 
instability [1]. Training was done by splitting the original training dataset into 
training set (80%) and test set (20%) for 100 epochs with batch size of 2. There- 
fore, 1000 MR volumes are used to train the model while 251 MR volumes were 
used as test set. 


Inference. The BraTS 2021 validation dataset contains 219 MR volumes and 
synapse portal conducts the evaluation. In the inference phase, the original vol- 
ume re-scaled using min-max scaling followed by clipping intensity values and 
cropped to 240 x 240 x 155 before feeding to the saved 3D UNet model. 


4.2 Performance Evaluation 


Segmentation accuracy of three classes (i.e., ET, TC and WT) are evaluated 
during training and inference. Both qualitative and quantitative analysis is per- 
formed to evaluate the model accuracy. 
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Table 1. Validation Phase Results. 


Class Hausdorff Dice score | Sensitivity | Specificity 
distance 
Enhanced Tumor (ET) | 21.8296 81.3898 83.3949 99.9695 
Tumor Core (TC) 8.5632 85.3856 85.0726 99.9745 
Whole Tumor (WT) 5.3686 90.7654 92.0858 99.9107 
1.2 
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Fig. 3. The box and whisker plots of the distribution of the segmentation metrics for 
Validation Phase Results. The box-plot shows the minimum, lower quartile, median, 
upper quartile and maximum for each tumor class. Outliers are shown away from lower 
quartile. 
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Evaluation Matrices. The learning model is evaluated using four matrices (1) 


Dice Sørensen coefficient (DSC), (2) Hausdorff Distance, (3) Sensitivity and (4) 
Specificity. 


Axial View Coronal View Sagittal View 


Fig. 4. Validation Phase Results for the Sample BraTS2021_00190. Here, green, 
yellow and gray represents the Whole tumor (WT), Enhancing Tumor (ET) and 
Tumor Core (TC) classes respectively. (Dice (ET) = 97.2585, Dice (TC) = 99.1492, Dice 
(WT) = 97.5753) (Color figure online) 


Validation Phase Experimental Results. The quantitative and qualitative 
results during validation phase for the proposed approach is shown in Table 1 
Figs. 3 and 4. It is noticeable that, the proposed framework helps in identifying 
fine predictions successfully. 


Testing Phase Experimental Results. Our final evaluation results on the 
testing dataset are shown in Table 2. Compared to validation phase results, it can 
be seen that average of Dice Similarity Scores for tumor sub regions is improved 
during testing phase. 


Table 2. Testing phase results. 


Class Hausdorff Dice score Sensitivity Specificity 
distance 

Enhanced Tumor (ET) | 13.4802 (84.5530 88.0258 99.9680 

Tumor Core (TC) | 16.9814 85.3010 87.7660 99.9637 


Whole Tumor (WT) | 6.3239 (90.4583 92.1467 99.9161 
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5 Conclusion 


In this work, we demonstrate a simple and effective way to improve training 
of 3D U-Net by reciprocal adversarial learning. Our approach extends the VAT 
method, making the segmentation network robust to adversarial perturbations, 
by generating adversarial examples and adapt min-max approach adapting GAN 
architecture. Our experiments showed that the virtual adversarial training and 
uncertainty guidance help to encourage the performance of the segmentation 
network. 
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Abstract. Brain tumor segmentation by computer computing is still 
an exciting challenge. UNet architecture has been widely used for medi- 
cal image segmentation with several modifications. Attention blocks have 
been used to modify skip connections on the UNet architecture and result 
in improved performance. In this study, we propose the development of 
UNet for brain tumor image segmentation by modifying its contraction 
and expansion block by adding Attention, adding multiple atrous con- 
volutions, and adding a residual pathway that we call Multiple Atrous 
convolutions Attention Block (MAAB). The expansion part is also added 
with the formation of pyramid features taken from each level to produce 
the final segmentation output. The architecture is trained using patches 
and batch 2 to save GPU memory usage. Online validation of the seg- 
mentation results from the BraTS 2021 validation dataset resulted in 
dice performance of 78.02, 80.73, and 89.07 for ET, TC, and WT. These 
results indicate that the proposed architecture is promising for further 
development. 


Keywords: Atrous convolution - Attention block - Pyramid features - 
Multiple atrous convolutions attention block - MAAB 


1 Introduction 


Segmentation of brain tumors using computer computing is still an exciting 
challenge. Several events have been held to get the latest methods with the best 
segmentation performance. One event that continues to invite researchers to 
innovate related to the segmentation method is the Brain Tumor Segmentation 
Challenge (BraTS Challenge). This BraTS Challenge has been held every year, 
starting in 2012 until now in 2021 [4]. 

The BraTS 2021 challenge is held by providing a larger dataset than the 
previous year. Until now, the dataset provided consists of training data accom- 
panied by a label with a total of 1251 data and validation data that is not 
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accompanied by a label with a total of 219 data. This validation data can be 
checked for correctness of labeling using the online validation tool provided on 
the https://www.synapse.org site [5—7, 12]. 

Among the many current architectures, UNet has become the widely used 
architecture as a medical image segmentation model. Starting with use in seg- 
menting neuronal structures in the EM Stack by [14], this architecture has been 
developed for segmenting 3D medical images. The development of UNet includes 
modifying existing blocks at each level, both in the expansion and decoder parts, 
modifying skip connections, and adding links in the decoder section by adding 
some links to form pyramid features. 

One of the developments of the UNet architecture is to modify the skip 
connection part. Modifications are made by adding an attention gate which is 
intended to be able to focus on the target segmentation object. This attention- 
gate model is taught to minimize the influence of the less relevant parts of the 
input image while still focusing on the essential features for the segmentation 
target [15]. 

Other UNet architecture developments are block modification as done in [1] 
by creating two paths in one block. One path uses convolution with kernel size 
5 x 5 followed by normalization and relu. The other path uses convolution with 
a kernel size of 3 x 3 followed by residual blocks. Merging the output of each 
path is done by concatenating the output features of each path. On the other 
hand, some modify the block from UNet by using atrous convolution to get a 
wider reception area [17]. 

The merging of feature maps which are the outputs of each level in the 
UNet decoder section, to form a feature pyramid is also carried out to improve 
segmentation performance as was done in [13]. The formation of this pyramid 
feature was inspired by the [10] research which was used to carry out the object 
detection process. This pyramid feature is also used in several studies to segment 
brain tumors [18,21, 22]. 

In this study, a modification of the UNet architecture was proposed for 
processing brain tumor segmentation from 3D MRI images. The modifications 
include modifying each block with multiple atrous convolutions, adding an atten- 
tion gate accompanied by a residual path to keep accelerating the convergence 
of the model. The skip connection portion of UNet was modified by adding an 
attention gate connected to the output of the lower expansion block. Moreover, 
the last modification is using pyramid features by combining the feature outputs 
from each level in the expansion section, which is connected to a convolution 
block to produce segmented outputs. The segmentation performance obtained is 
promising. 


2 Methods 


2.1 Dataset 


The datasets used in this study are the BraTS 2021 Training dataset and the 
BraTS 2021 validation dataset. Each dataset was obtained with different clinical 
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protocols and from different MRI scanners from multiple providing institutions. 
The BraTS 2021 Training dataset contains 1251 patient data with four modali- 
ties, T1, T1Gd, T2, and T2-Flair, accompanied by one associated segmentation 
label. There are four types of segmentation labels with a value of 1 indicating 
Necrosis/non-enhancing tumor, 2 representing edema, a value of 4 indicating 
tumor enhancing, and 0 for non-tumor and background. The labels provided are 
annotated by one to four annotation officers and are checked and approved by 
expert neuro-radiologists. 

The BraTS 2021 Validation dataset, on the other hand, is a dataset that 
does not come with a label. The segmentation results must be validated online 
by submitting it to the provided online validation site! to obtain the correctness 
of labeling. This BraTS 2021 validation dataset contains 219 patient data with 
the same four modalities as the BraTS 2021 Training dataset. 


2.2 Preprocessing 


The 3D images of the BraTS 2021 training dataset and the BraTS 2021 vali- 
dation dataset were obtained from a number of different scanners and multiple 
contributing institutions. The value of the voxel intensity interval of each 3D 
image produced will be different. So these values need to be normalized so that 
they are in the same interval. Each of these 3D images was normalized using the 
Eq. 1 similar to that done in [2]. 


Lorig =H 
e 1 
; (1) 


Inorm = 


where Inorm and Iorig are the normalized image and the original image, while u 
and ø are the average value and standard deviation of all non-zero voxels in the 
3D image. The normalization process was carried out for each patient data and 
each modality-both for the BraTS 2021 training dataset during training and the 
BraTS 2021 validation dataset during inference. 


2.3 Proposed Architecture 


The architecture proposed in this study is developing the UNet architecture with 
a 3D Image processing approach. The proposed architecture used is shown in 
Fig. 1. 

All modalities are used in this study, followed by a dropout layer as 
regularization-the use of dropout as one of the regularization models as pro- 
posed by [16]. The use of dropout as regularization is also used in several studies 
with a rate that varies between 0.1 to 0.5 [3,8,9,11,19,20]. In this paper, the 
dropout rate value used is 0.2 with the placement at the beginning of the layer. 

The next layer is the Multi Atrous Attention Block (MAAB). There are 
several levels in this block, starting with levels 1, 2, 3 and 4. Details of the 
internal visualization within the block are shown in Fig. 2. 


1 https://www.synapse.org. 
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Fig. 1. Unet3D with multiple atrous convolution attention block 
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Fig. 2. Multiple Atrous Attention Block - MAAB 


This MAAB block processes feature maps equipped with atrous convolutions 
with different dilatation factors according to their level. The atrous convolution 
function expands the receptive field area of the feature map without increasing 
the number of parameters that must be studied. The deeper the downsampling 
level, the greater the level of the MAAB block to increase the receptive field area 
that can be covered and increase architectural performance in studying feature 
maps. 
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In the first level, the MAAB block contains one convolution layer with a pre- 
activation strategy. For the second level, in addition to containing the first level 
layer, one atrous convolution layer is also added with a factor of 2. The following 
blocks contain the previous blocks with an increasing convolution atrous layer- 
the order of the dilatation factors in the convolution layers 1, 2, 4, and 8. The 
residual path is connected from the convolution results at the beginning of the 
block with the combined output of the levels used in this MAAB block by using 
the feature addition function. At the end of the block, an attention sub-block is 
added to keep the focus on relevant features. 

The skip connection is modified by adding an attention block before being 
connected to the expansion section feature. This attention block is used to keep 
the model focused on relevant features such as the initiative in [15]. The attention 
diagram used in this study is shown in the Fig. 3. G in the figure is a feature that 
comes from the expansion level before being upsampled, while X is a feature of 
the skip connection of the contraction section. The output of this attention block 
is combined with the upsampling feature at an equivalent level for subsequent 


processing. 
an GN ReLU CV Sigmoid 
A 


Fig. 3. Attention block diagram 


In the expanding section, the feature maps at each level are concatenated 
together before being inserted into the last MAAB level 1 block. The feature 
map at the lowest level is upsampled by a factor of four, while the second level 
is upsampled by a factor of two to equal the size of the feature map at level 
one. This connection forms a feature map of the pyramid and the supervision of 
each lower level. The output of the last MAAB block is convoluted into three 
channels representing the segmentation target (ET, WT, and TC). 
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2.4 Loss Function 


The loss function used during the training process is diceloss with the formula 
expressed in the Eq. 2. The objects detected in the image consist of 3 types, 
namely Enhanced Tumor, Tumor Core, a combination of Enhanced Tumor and 
Necrotic objects, and Whole Tumor, which is a combination of all tumor objects. 
So that the loss function used uses the combination of the three areas with the 
weighting as stated in the Eq. 3. 


2x Poj X Yoz + € 
|Posjl + |Yovj| + € 


dlossop;(P,Y) = 1 (2) 


Loss = 0.34 x dlosspr + 0.33 x dlossrc + 0.33 x dlosswr (3) 


where P represents the predicted result, Y represents the segmentation target, 
€ is filled with a small value to avoid dividing by zero. Furthermore, ET, TC, 
and WT represent Enhanced Tumor, Tumor Core, and Whole Tumor areas. 


2.5 Experiment Settings 


The hardware used in this study includes an Nvidia RTX 2080i 11GB, 64GB 
RAM, and a Core I7 processor. While the Deep Learning framework software 
used is Tensorflow/Keras version 2.5. 

The training was carried out using the BraTS 2021 training dataset, which 
contained 1251 patient data with four modalities (T1, T1Gd, T2, T2-Flair) and 
one ground-truth file for each patient. The data is split into two parts, with 
80% as training data and 20% as local validation data. To minimize variation in 
training, a 5-fold cross-validation strategy is used. 

The model was trained using Adam’s optimizer with a learning rate of 1e-4 
for 300 epochs for each fold. Data augmentation techniques used include random 
crop, three-axis random permutation, random replace channel with gaussian 
distribution, and random mirroring of each axis. 

Data is trained with patches of size 72 x 72 x 72 and batch size of 2 to 
minimize GPU memory requirements. The 3d image patches were taken from 
the area containing the tumor at random. During the inference process, the data 
is processed at size 72 x 72 x 72 but with a shift of 64 voxels to each axis. 
Voxels from the overlapping segmentation results are averaged to get the final 
segmentation result. 


3 Results 


The time required for training and inference model using the five-fold strategy 
as shown in the Table 1. From the Table 1 it can be seen that the average time 
required for a 5-fold training with 300 epochs is 104408 s. Alternatively, per- 
epoch, it takes 348,027 s. This time is needed for training 1001 data and local 
validation for 250 data. The average inference time required is 1530s seconds 
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as shown in Table 1. This time is used to segment the data as much as 219 
data. So that processing for each data takes an average of 6.99s. Meanwhile, if 
using a combination of 5 models, it will take 10054s so that the processing of 
an ensemble of 5 models for each data takes an average of 45.91s. 


Table 1. Model training time on 300 epochs 


Fold Training time (s) | Inference time (s) 
Fold 1 | 104172 1567 
Fold 2 | 104258 1522 
Fold 3 | 104159 1514 
Fold 4 | 104652 1516 
Fold 5 | 104799 1531 
Average | 104408 1530 


Loss obtained during training for each fold as shown in Fig.4. From the 
figure, the most stable is the 3rd fold and the 5th fold with no spikes in value in 
the graph. While in others, there is a spike in value at certain times. As in the 
lst fold, there was a spike value at the epoch between 50-100 for both training 
and validation loss. Likewise, in the 2nd fold and fourth fold. This condition 
is possible because this training uses random patches. When taking a random 
patch, there may not be an object, but the model detects an object so that the 
loss value will approach the value of 1. 

From Fig. 4(f), it can be seen that the overall training of this model is con- 
vergent. The spikes in value do not exceed the initial loss value. At the end of 
the epoch, the loss values for training and validation also converge. In all graphs 
(a-e), the existing convergence pattern is close to the convergent value. The val- 
idation loss value is also not much different from the training loss value, so it 
can be said that the model is not overfitting. 

The results of the dice score performance during training are congruent with 
the loss value. Assuming that the loss function used is 1 —dice. However, because 
there are three objects counted in the dice, the loss value is an amalgamation 
of the dice scores of each object with a weight determined in the Eq. 3. The 
average dice value of each object during training for all folds as shown in Fig. 5. 
The validation scores for ET and TC objects have a good pattern, with val- 
ues increasingly outperforming the training score near the end of the epoch. In 
comparison, the validation score for the WT object is always below the training 
score of the WT. However, the score pattern of each object increases until the 
end of the epoch. 
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Fig. 4. Loss value during training for each fold. (a)—(e) Training and validation loss in 
the first fold to the fifth fold. (f) Average training and validation loss on 5-fold cross 
validation 
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Fig. 5. Average dice score on 5-fold cross validation training: (a) Average dice score 
for ET Object, (b) Average dice score for TC Object, (c) Average dice score for WT 
Object. 


Online validation of segmentation results using the 1st to fifth fold model is 
displayed in Table 2. Five models of training results ensembled using the average 
method can also be seen in the table. 


Table 2. Online validation result on BraTS 2021 validation dataset 


Model Dice (%) Sensitivity (%) Specificity (%) Hausdorff95 
ET TC WT |ET TC WT ET TC WT | ET TC WT 

FOLD1 75.82 | 79.51 | 88.72 | 73.42 | 76.53 | 90.19 | 99.98 | 99.98 | 99.90 | 25.53 | 17.36 | 7.35 
FOLD2 73.85 | 79.76 | 87.47 | 77.91 | 82.21 | 91.17 | 99.96 | 99.95 | 99.86 | 38.11 | 19.84 | 14.46 
FOLD3 75.46 | 79.69 | 86.89 | 80.75 | 81.74 | 91.57 | 99.96 | 99.96 | 99.85 | 30.98 | 20.30 | 18.86 
FOLD4 74.74 | 77.32 | 85.56 | 76.73 | 76.47 | 92.09 | 99.97 | 99.97 | 99.81 | 32.91 | 18.59 | 20.35 
FOLD5 76.48 | 74.72 | 87.70 | 80.47 | 76.45 | 91.34 | 99.96 | 99.97 | 99.87 | 28.41 | 28.97 | 12.10 
ENSEMBLE | 78.02 | 80.73 | 89.07 | 80.51 | 80.55 | 92.34 | 99.97 | 99.97 | 99.88 | 25.82 | 21.17 | 11.78 


This architecture is also tested with the BraTS 2021 testing dataset for the 
challenge. The ground truth for this dataset is not provided. We only send the 
codes that form the architecture and the mechanism for segmenting one patient 
data individually along with the weight file of the model in a docker format. We 
use five models that are ensembled into one with the same averaging method as 
the ensemble model used in the Table 2. The performance results of the 5 model 
ensemble applied to the BraTS 2021 testing dataset are outstanding, as shown 
in the Table 3. 
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Table 3. Online result on BraTS 2021 testing dataset 


Model Dice (%) Sensitivity (%) Specificity (%) Hausdorff95 
ET TC WT |ET TC WT |ET TC WT |ET TC WT 
Mean 81.68 | 82.92 | 88.42 | 84.82 | 85.34 | 92.29 | 99.97 | 99.96 | 99.89 | 19.70 | 23.01 | 10.70 


StdDev 22.30 | 25.52 | 13.29 | 22.50 | 24.45 | 9.87] 0.05} 0.07} 0.15 | 70.71 | 73.63 | 18.54 
Median 89.57 | 93.10 | 92.72 | 93.09 | 95.20 | 95.74 | 99.98 | 99.98 | 99.93 | 1.73] 2.45] 3.61 
25quantile | 79.84 | 83.86 | 88.13 | 83.51 | 85.34 | 90.66 | 99.96 | 99.97 | 99.88 | 1.00] 1.00) 1.73 
75quantile | 94.09 | 96.54 | 95.55 | 97.05 | 98.28 | 98.04 | 99.99 | 99.99 | 99.96 | 3.61] 7.25} 9.10 


4 Discussion 


In this study, we propose a modified Unet3D architecture for brain tumor seg- 
mentation. Modifications include modification of each block with atrous convo- 
lution, attention gate, and the addition of residual path. The skip connection 
section is modified by adding an attention gate that combines the features of 
the contraction section with the expansion section one level below its equivalent 
level. The pyramid feature is also added to get better segmentation performance 
results. Checking using the combination of 5 models on the validation dataset 
resulted in segmentation performance of 78.02, 80.73, and 89.07 for ET, TC, and 
WT objects. 

In Fig. 4 especially in parts (a), (b), and (d) there is a spike in loss value in 
certain epochs. The alleged cause of this incident is that random patch picking 
will result in a volume that has no object, either ET, TC, or WT, but the model 
still gets its predictions, causing the loss value to spike suddenly. However, the 
exact cause needs further investigation. 
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Abstract. Since 2012 the BraTS competition has become a benchmark 
for brain MRI segmentation. The top-ranked solutions from the com- 
petition leaderboard of past years are primarily heavy and sophisticated 
ensembles of deep neural networks. The complexity of the proposed solu- 
tions can restrict their clinical use due to the long execution time and 
complicate the model transfer to the other datasets, especially with the 
lack of some MRI sequences in multimodal input. The current paper pro- 
vides a baseline segmentation accuracy for each separate MRI modality 
and all four sequences (T1, T1c, T2, and FLAIR) on conventional 3D 
U-net architecture. We explore the predictive ability of each modality to 
segment enhancing core, tumor core, and whole tumor. We then com- 
pare the baseline performance with BraTS 2019-2020 state-of-the-art 
solutions. Finally, we share the code and trained weights to facilitate 
further research on model transfer to different domains and use in other 
applications. 


Keywords: brain MRI - Medical segmentation - U-Net + BRATS2021 


1 Introduction 


1.1 MRI-Based Models for Brain Tumor Segmentation 


Following the success of computer vision-based detection systems in mammog- 
raphy, [2] and pulmonology [3], deep learning (DL) models application for brain 
MRI is extensively studied [1]. The emergence of DL solutions that outperform 
the standard first read of the medical image becomes possible for several rea- 
sons: the progress of hardware and software for computer vision, improvements 
in data management and sharing policies, but most importantly - because of 
massive human-labeled databases. 

Brain MR imaging has several peculiarities: collected data are predominantly 
three-dimensional, serial, or multimodal and domain (scanner) specific. Investi- 
gated neuropathological cases are rare; data gathering and labeling are expen- 
sive, causing smaller sample sizes for research. Thus, an average sample size 
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for DL training is restricted with hundreds of three-dimensional samples, which 
compromises training domain-stable solutions [4]. 

BraTS is the most extensive open-source collection of labeled brain MR 
images, which makes the dataset of the most interest in developing state-of-the- 
art DL solutions in neuroradiology. BraTS2021 collection includes more than a 
thousand annotated cases for supervised DL model training and, notably, for 
transfer learning to other brain diagnostics datasets and pathologies [5]. 


1.2 MRI Modalities in Brain Tumor Segmentation 


In diagnosing brain tumors, magnetic resonance imaging (MRI) is widespread 
or even ubiquitous. The great diversity of imaging modalities makes it possi- 
ble to explore and highlight the different tissue contrasts and unique details 
related to each part of the tumor. The most informative modalities, and simul- 
taneously ones included in brain cancer treatment protocols, are T1-weighted 
(T1), T2-weighted (T2), Tl-weighted with gadolinium contrast enhancement 
(T1-Gd or Tic), and T2 Fluid Attenuated Inversion Recovery (FLAIR). Each of 
them, under their characteristics, emphasizes different features [11]. T1 is good 
at distinguishing healthy tissue from malignant regions, T2 with its bright sig- 
nal highlights areas of edema; T1-gd is more suitable for defining tumor bound- 
aries; FLAIR MR images are used to differentiate edema from cerebrospinal fluid 
(CSF). 

Yet different protocols for clinical brain tumor imaging can vary from hospital 
to hospital and include other biology-driven MRI methods for surgical and radio- 
surgical planning and assessment of treatment response. These MR modalities 
can include diffusion-tensor imaging (DTI), perfusion-weighted imaging (PWI), 
susceptibility-weighted imaging (SWI) [6] and others. Thus, there is a need to 
independently explore each MR sequence’s predictive ability and build solutions 
without fixing the input set of modalities [12]. 


1.3 Architectures for MRI Segmentation 


First DL Approaches for Image Segmentation and SOTA Solutions. 
First DL architectures for semantic segmentation appeared in 2015 with fully 
convolutional networks (FCN) [7]. Then convolutional encoder-decoder architec- 
tures of SegNet [8] and U-net [9] showed drastically better performance than just 
bilinear interpolation of the last layers in FCN. 

Today, U-nets, being proposed in 2016, are still considered conventional for 
medical image segmentation tasks [10]. The original architecture has undergone 
modifications, for example, gained 3D convolutions (3D U-net), residual con- 
nections (Residual U-net) and incorporated DenseNet blocks (Dense U-net). 
nn-Unet architecture [24], proposed in 2020, is recognized as the benchmark for 
medical image segmentation. 

Flexible architecture of the nn-Unet allows the addition of extra blocks to 
construct deeper network, additional channels to train multimodal input, and 
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training on image parts (patches) for memory optimization. With data augmen- 
tation and model fine-tuning, the architecture performs good with less memory 
consumption than trained on full-sized images. Therefore a significant portion 
of today’s MRI image segmentation solutions is based on the modifications of 
3D U-net [13] and their ensembles. 

It is worth saying that U-shaped DL architecture with up and down convo- 
lutions is not the only solution for medical image segmentation. There are lately 
proposed algorithms for brain tumor segmentation based on the engineering app- 
roach of MR images processing, thresholding, and binarization, which does not 
require extensive DL training [11]. On the opposite, more profound and heavier 
network architectures can outperform U-nets on distinct tasks. Recent U2-net 
architecture [14] showed better background separation on 2D images and exhibit- 
ing the high potential to compete with original ensembles architecture. Lately, 
proposed transformer architectures incorporated in U-net architecture are shown 
to outperform conventional ones on a small sample of abdominal images [16]. 
There are adversarial U-nets with a GAN-based structure. With more training, 
the SeGAN [15] outperforms conventional architectures. 

Although the architectures above can provide more accurate predictions, at 
the moment, they are harder to train, fine-tune, and transfer to other domains, 
which is an especially important instrument in medical image analysis [17,18]. 
Thus, in current paper we aim to: 


1. perform an ablation study to find optimal 3D U-net training setting for better 
convergence of lightweight model on BRATS2021 data; 

2. explore the predictive ability of separate MR modalities for brain tumor seg- 
mentation; 

3. share the trained weights to facilitate further research on transfer to other 
datasets and contrast-agnostic solutions. 


2 Experiments 


We choose the experiment design to find the most lightweight U-net architecture 
and training schema to achieve reasonable data segmentation quality on the 
BraT$2021 training sample. We performed an ablation study on each separate 
modality, as well as on multiple sequence input. Thus the shared weights could 
be further used for transfer learning, or pre-training on different modalities and 
their combinations. 


2.1 Dataset 


The Multi-modal Brain Tumor Segmentation Challenge (BraTS) 2021! dataset 
for segmentation task represented with a multi-parametric MRI (mpMRI) scans 
of glioma. Segmentation labels include glioma sub-regions - the “enhancing 
tumor” (ET), the “tumor core” (TC), and the “whole tumor” (WT). The 


1 http: //braintumorsegmentation.org/. 


BRATS2021: U-Net Baseline 197 


MRI scans for subjects are provided in multiple sequences: native (T1), post- 
contrast T1-weighted (T1c), T2-weighted (T2), and T2 Fluid Attenuated Inver- 
sion Recovery (FLAIR) volumes. 

Data preprocessing pipeline CaPTk include co-registration, interpolation into 
1 mm? isotropic resolution with an image size of 240 x 240 x 155, and the skull- 
stripping [25, 26]. 


2.2 Baseline U-net Model 


Data Preprocessing and Augmentation. The overall pipeline was written 
with torchio”package [20], with build in U-net and patches creation, as well 
as data augmentation applied to data with varying probability. Prior to train- 
ing HistogramStandardization, ZNormalization were applied to the whole 
training sample to make zero mean, unit variance and standardize histogram of 
foreground. 

We explored two variants of data augmentation while training, there p relates 
to the probability of transform application and n - to the number of artifacts 
produced: 


1. Restricted - includes RandomAnisotropy,p=0.25, RandomBlur , p=0.25, 
RandomNoise,p=0.25 (Gaussian noise), RandomBiasField, p=0.3 (to elimi- 
nate magnetic field inhomogeneity); 

2. Extensive - includes Restricted augmentations with [RandomAffine, p=0.8; 
RandomElasticDeformation, p=0.2] (the probability of 0.8 is set for the pair 
of transformations), [RandomMotion,n=1; RandomSpike ,n=2; 
RandomGhosting ,n=2] (the probability of 0.5 is set for three augmentations). 


Training with extensive augmentations doubled the convergence time with no 
significant quality increase, thus all experiments with extensive augmentations 
were excluded from the results. 

On model architecture, we explored several U-net modifications, extending 
the depth and width of the network and changing the normalisation and upsam- 
pling: 


1. Model 1: with 3 encoding blocks and 4 out channels for first layer, patch size 

64, batch normalization and ReLU activation function, linear upsampling; 

2. Model 2: with 3 encoding blocks and 4 out channels for first layer, patch size 

128, batch normalization and ReLU activation function, linear upsampling; 

3. Model 3: with 5 encoding blocks and 4 out channels for first layer, patch size 

128, batch normalization, PReLU activation function, linear upsampling; 

4. Model 4: with 5 encoding blocks and 16 out channels for first layer, patch 

size 128, instance normalization and Leaky ReLU activation function, linear 

upsampling; 

5. Model 5: with 5 encoding blocks and 4 out channels for first layer, patch size 
128, batch normalization, PReLU activation function, trilinear upsampling 
and preactivation; 


? https: //torchio.readthedocs.io/. 
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6. Model 6: with 5 encoding blocks and 16 out channels for first layer, patch size 
128, batch normalization, PReLU activation function, trilinear upsampling 
and preactivation; 


Batch size was adjusted to GPU capacity. Due to the large input image size, 
the input patch size is 64 x 64 x 64 for the first experiments and 128 x 128 x 128 
for deeper U-nets, and the batch size equals 32 and 16 for train, validation set, 
accordingly. It worth mentioning that for deeper U-net architecture, even uni- 
modal input model does not fit into one GPU and was paralleled to two GPU’s 
even with drastically reduced batch size (4 for train and 2 for validation). 

The ablation study was performed with the choose of optimizer, loss and 
different augmentations for Uni and Multi-modality image input. As optimizer 
we use AdamW with default parameters and Adam with learning rate le-3 and 
weight decay le-4. We use stochastic gradient descent optimizer (SGD) with an 
initial learning rate of 0.01, and momentum of 0.9 with weight decay. 


2.3 Comparison with BraTS Toolkit Solutions 


To compare the baseline U-net performance with State Of The Art (SOTA) 
networks we chose the two latest solutions implemented in BraTS Toolkit? [21]. 
BraTS Toolkit provides software for brain tumor segmentation, it incorporates 
state-of-the-art solutions for the past years BraTS competitions in their stable 
executable versions (docker containers) and a fusion of their predictions. 

The latest uploads in BraTS Toolkit are scan-2019 and scan lite-20 
implementing solution from the paper Triplanar Ensemble of 3D-to-2D CNNs 
with Label-Uncertainty for Brain Tumor Segmentation [23]. Additionally we 
compare these results to the containerized solution xyz 2019 representing an 
implementation of U-net based Self-ensemble network [22]. 

The one subject prediction (inference) time on GPU for scan lite-20 does 
not exceeded 5 min, for xyz 2019 and scan-2019 - 20min. It is assumed that 
data preprocessing is identical to the previous years’ data, and therefore the 
models could be applied directly. It is worth noting that these solutions are 
trained in previous years’ BraTS data, and scoring on BraTS 2021 training data 
can be compromised by data leakage. Thus BraTS Toolkit solutions were scored 
on a blind validation set, assuming no data leak from previous years. 


2.4 Experimental Settings 


All the experiments were implemented on pytorch and trained on two NVIDIA 
Tesla P100 PCIe 16 GB GPU. A relatively large sample size of BraT's2021 train- 
ing sample allowed to compare models on a single train/test split with a ratio of 
0.7 (the same split for all models). The number of subjects in train/test equals 
939/312, respectively (train and test sets are training sample from competition). 

Thus BraTS Toolkit solutions were scored on a blind validation set 219, 
assuming no data leak from previous years. 


3 https: //github.com/neuronflow/BraTS- Toolkit. 
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3 Results 


The convergence of models on T1, T1c, T2 and FLAIR on 40 epochs for 3-block 
3D U-net architecture with data augmentations, shown on Fig. 1. The experiment 
report is available on Weights&Biases page’. 

The iterations of 3D U-net ablation study represented in Table 2. 

The BraTS Toolkit predictions for two models were scored on a competition 
blind validation set and shown in the Table 1. 


Table 1. Results of two models from BraTS Toolkit on a competition blind validation 
set in terms of Dice coefficient, Hausdorff 95, and Sensitivity (Specificity for equals 
0.999 for all table entries). 


Model Dice Hausdorff 95 Sensitivity 

ET |TC | WT ET TC | WT |ET |TC |WT 
zyx 2019 0.809 | 0.866 0.915 | 14.905 | 6.433 | 4.332 | 0.806 | 0.860 | 0.907 
scan lite-20 | 0.830 | 0.868 | 0.922 | 14.502 | 7.913 | 3.949 | 0.808 | 0.863 | 0.914 


We show that a model based on Tic images shows better convergence than 
other modalities with the same training conditions. Training on all modalities 
simultaneously, naturally leads to better quality as it integrates information from 
each sequence. In addition, 

First experiments on a selection of the model parameters on the T1 sequence 
showed that using SGD optimizer leads to smoother convergence. Adding 
restricted augmentations solves the fluctuating validation loss, while extended 
augmentations significantly raise the training process and negatively affect the 
quality. We show the quality increase with bigger patch size and deeper model 
architecture. The combination of Cross-Entropy (CE) loss with DICE loss sig- 
nificantly improves training of the model. 

We found no significant difference in activation functions PReLU, Leacky 
ReLU or default ReLU. Yet, we notice the minimisation of training time while 
using trilinear upsampling and U-net preactivations. 

The best performance achieved with the reported U-net model was acquired 
with multi-modal input, bigger image patches and deeper model architecture 
trained with data augmentations: DICE scores are 0.623, 0.791, 0.779 for ET, 
TC and WT respectively. This is significantly lower, than performance of last 
year state-of-the-art ensemble solutions: 0.830, 0.868 0.922 DICE scores for ET, 
TC and WT. 


4 wandb.ai/polina/brats/reports/Brats--Vm11dzo5NDK5NIM?accessToken=zmj73 
popylirho9qb51g4fh241g9qxopkmf suz2xccgzen567 1qtqwq9buu8ccOdv. 
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Table 2. Segmentation results on validation set (part of competition training samples 
after train-test split for fine-tuning experiments) in terms of Dice coefficient. 


N | MRI sequences Model | Optimizer | Augment | Epochs | ET TC WT 
IT 1 AdamW |X 40 0.199 | 0.394 | 0.178 
2/T 1 Adam x 40 0.242 | 0.418 | 0.178 
3| Tic 1 Adam x 40 0.435 | 0.395 | 0.540 
4| Tic 1 AdamW |X 40 0.471 | 0.386 | 0.537 
5|T 1 AdamW |X 40 0.253 | 0.324 | 0.220 
6|T 1 SGD x 40 0.288 | 0.358 | 0.245 
TIKE 1 SGD v 40 0.257 | 0.383 | 0.245 
8|T ji SGD v 60 0.304 | 0.412 | 0.305 
9 | FLAIR 1 SGD v 60 0.273 | 0.506 | 0.323 

10 | T2 1 SGD v 60 0.355 | 0.539 | 0.372 

11 | The SGD v 60 0.505 | 0.436 | 0.609 

12 | T1, Tic, T2, FLAIR | 2 SGD v 40 0.536 | 0.715 | 0.705 

13 | Tic 2 SGD v 60 0.560 | 0.500 | 0.679 

14) Tic 3 SGD v 60 0.577 | 0.497 | 0.684 

15 | Tic, FLAIR 3 SGD v 40 0.608 | 0.753 | 0.757 

16 | Tic 4 SGD v 60 0.624 | 0.605 | 0.757 

17 | Tic, FLAIR 5 SGD v 60 0.616 | 0.778 | 0.763 

18 | Tic, FLAIR 5 SGD v 100 0.608 | 0.775 | 0.758 

19 | T1, Tic, T2, FLAIR | 6 SGD v 30 0.621 | 0.785 | 0.766 

20 | Tic, FLAIR 6 SGD v 30 0.623 | 0.791 | 0.779 

avg train loss an val_loss 2 
a) b) 


Fig. 1. 3D U-net lightweight architecture training on Uni and Multi-modal image input; 
(a) Train loss; (b) Validation loss; 


4 Conclusion and Discussion 


In current paper we provide a baseline segmentation accuracy for each separate 
MRI modality and all four sequences (T1, T1c, T2, and FLAIR) on conventional 
3D U-net architecture. 

We performed the ablation study and training strategy for better 3D U-net 
training on MR image patches. 

We explored the predictive ability of each modality for the enhancing core, 
tumor core, and whole tumor, and find out that post-contrast T1 has more 
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predictive ability for all tumor regions. We and compare the baseline performance 
with BRATS2019-2020 winning solutions zyx 2019 and scan lite . Finally, we 
share the code and trained weights to facilitate further research on transfer to 
different domains and use in other applications. 


Work Limitations. The BraTS toolkit solutions were scored according to the 
main study on the competition blind validation set. We assume that this com- 
parison is fair if that blind validation set was not comprised of images from 
previous BraTS releases. 

The chosen architectures are the most convenient ones, but they were shown 
to be outperformed by more complex variations of the U-net or their ensembles. 
The key idea of the current paper is to highlight the baseline accuracy for each 
modality instead of achieving the best performance. In the scope of this work, 
we were not aiming at remarkably changing the U-net architecture. Yet, it can 
be a valuable extension to train each modality on multiple classifier heads of the 
U-net and try nested structure or redesigned skip connections. 


Contribution. Polina Druzhinina - conducted experiments with U-net; Eka- 
terina Kondrateva - experiments design and Brats Toolkit models execution; 
Arseny Bozhenko and Vyacheslav Yarkin - docker creation, dataset manipula- 
tions and cluster maintenance; Maxim Sharaev and Anvar Kurumkov - con- 
ducted paper camera-ready version review. 


Acknowledgements. The reported study was funded by RFBR according to the 
research project 20-37-90149 and by RSCF grant according to the research project 
21-71-10136 (creating and testing DL models on MRI data). 


References 


1. Pominova, M., Artemov, A., Sharaev, M., Kondrateva, E., Bernstein, A., Burnaev, 
E.: Voxelwise 3D convolutional and recurrent neural networks for epilepsy and 
depression diagnostics from structural and functional MRI data. In: IEEE Inter- 
national Conference on Data Mining Workshops (ICDMW), pp. 299-307. (2018). 
https: //doi.org/10.1109/ICDMW.2018.00050 

2. McKinney, S.M., Sieniek, M., Godbole, V., et al.: International evaluation of an 
AI system for breast cancer screening. Nature 577, 89-94 (2020). https://doi.org/ 
10.1038/s41586-019-1799-6 

3. Ardila, D., Kiraly, A.P., Bharadwaj, S., et al.: End-to-end lung cancer screening 
with three-dimensional deep learning on low-dose chest computed tomography. 
Nat. Med. 25, 954-961 (2019). https: //doi.org/10.1038/s41591-019-0447-x 

4. Kondrateva, E., Pominova, M., Popova, E., Sharaev, M., Bernstein, A., Burnaev, 
E.: Domain shift in computer vision models for MRI data analysis: an overview. 
In: Proc. SPIE 11605, Thirteenth International Conference on Machine Vision, 
116050H, 4 January 2021. https://doi.org/10.1117/12.2587872 

5. Cheplygina, V., de Bruijne, M., Pluim, J.P.W.: Not-so-supervised: a survey of semi- 
supervised, multi-instance, and transfer learning in medical image analysis. Med. 
Image Anal. 54, 280-296 (2019). https://doi.org/10.1016/j.media.2019.03.009 


202 


10. 


11. 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


P. Druzhinina et al. 


Villanueva-Meyer, J.E., Mabray, M.C., Cha, S.: Current clinical brain tumor imag- 
ing. Neurosurgery 81(3), 397—415 (2017). https: //doi.org/10.1093/neuros/nyx103 
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic 
segmentation. In: Proceedings of the IEEE Conference on Computer Vision and 
Pattern Recognition, pp. 3431-3440 

Badrinarayanan, V., Kendall, A., Cipolla, R.: Segnet: a deep convolutional encoder- 
decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. 
Intell. 39(12), 2481-2495 (2017) 

Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomed- 
ical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. 
(eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234-241. Springer, Cham (2015). 
https: //doi.org/10.1007/978-3-319-24574-4_28 

Siddique, N., Paheding, S., Elkin, C.P., Devabhaktuni, V.: U-Net and its variants 
for medical image segmentation: a review of theory and applications. IEEE Access 
9, 82031-82057 (2021). https://doi.org/10.1109/ ACCESS.2021.3086020 
Ranjbarzadeh, R., Bagherian Kasgari, A., Jafarzadeh Ghoushchi, S., et al.: Brain 
tumor segmentation based on deep learning and an attention mechanism using 
MRI multi-modalities brain images. Sci. Rep. 11, 10930 (2021). https://doi.org/ 
10.1038/s41598-021-90428-8 

Billot, B., et al.: A Learning Strategy for Contrast-agnostic MRI Segmentation. 
Medical Imaging with Deep Learning. PMLR (2020) 

Wang, F., Jiang, R., Zheng, L., Meng, C., Biswal, B.: 3D U-Net based brain tumor 
segmentation and survival days prediction. In: Crimi, A., Bakas, S. (eds.) BrainLes 
2019. LNCS, vol. 11992, pp. 131-141. Springer, Cham (2020). https://doi.org/10. 
1007 /978-3-030-46640-4_13 

Qin, X., Zhang, Z., Huang, C., Dehghan, M., Zaiane, O.R., Jagersand, M.: U2-Net: 
going deeper with nested U-structure for salient object detection. Pattern Recogn. 
106, 107404 (2020) 

Xue, Y., Xu, T., Zhang, H., Long, L.R., Huang, X.: Segan: adversarial network 
with multi-scale 1 1 loss for medical image segmentation. Neuroinformatics 16(3— 
4), 383-392 (2018) 

Chen, J., et al.: Transunet: transformers make strong encoders for medical image 
segmentation. arXiv preprint arXiv:2102.04306 (2021) 

Cheplygina, V., de Bruijne, M., Pluim, J.P.W.: Not-so-supervised: a survey of semi- 
supervised, multi-instance, and transfer learning in medical image analysis. Med. 
Image Analysis 54, 280-296 (2019) 

Chen, S., Ma, K., Zheng, Y.: Med3d: Transfer learning for 3d medical image anal- 
ysis. arXiv preprint arXiv:1904.00625 (2019) 

Menze, B.H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., 
et al.: The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). 
IEEE Trans. Med. Imaging 34(10), 1993-2024 (2015). https://doi.org/10.1109/ 
TMI.2014.2377694 

Pérez-García, F., Sparks, R., Ourselin, S.: TorchIO: a Python library for efficient 
loading, preprocessing, augmentation and patch-based sampling of medical images 
in deep learning. Computer Methods and Programs in Biomedicine, p. 106236. 
ISSN: 0169-2607, June 2021. https://doi.org/10.1016/j.cmpb.2021.106236 

Kofler, F., et al.: BraTS toolkit: translating BraTS brain tumor segmentation 
algorithms into clinical and scientific practice. Front. Neuroscience 14 (2020). 
125.0.3389/fnins.2020.00125 


22. 


23. 


24. 


25. 


26. 


BRATS2021: U-Net Baseline 203 


Zhao, Y.-X., Zhang, Y.-M., Song, M., Liu, C.-L.: Multi-view semi-supervised 3d 
whole brain segmentation with a self-ensemble network. In: Shen, D., Liu, T., 
Peters, T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P.-T., Khan, A. (eds.) MICCAI 
2019. LNCS, vol. 11766, pp. 256-265. Springer, Cham (2019). https://doi.org/10. 
1007/978-3-030-32248-9_29 

McKinley, R., Rebsamen, M., Meier, R., Wiest, R.: Triplanar ensemble of 3D-to-2D 
CNNs with label-uncertainty for brain tumor segmentation. In: Crimi, A., Bakas, 
S. (eds.) BrainLes 2019. LNCS, vol. 11992, pp. 379-387. Springer, Cham (2020). 
https: //doi.org/10.1007/978-3-030-46640-4_36 

Isensee, F., Jager, P.F., Full, P.M., Vollmuth, P., Maier-Hein, K.H.: nnU-Net for 
brain tumor segmentation. In: Crimi, A., Bakas, S. (eds.) BrainLes 2020. LNCS, 
vol. 12659, pp. 118-132. Springer, Cham (2021). https://doi.org/10.1007/978-3- 
030-72087-2-11 

Davatzikos, C., et al.: Cancer imaging phenomics toolkit: quantitative imaging 
analytics for precision diagnostics and predictive modeling of clinical outcome. J 
Med Imaging 5(1), 011018 (2018). https://doi.org/10.1117/1.JMI.5.1.011018 
Pati, S., et al.: The cancer imaging phenomics toolkit (CaPTk): technical overview. 
In: Crimi, A., Bakas, S. (eds.) BrainLes 2019. LNCS, vol. 11993, pp. 380-394. 
Springer, Cham (2020). https://doi.org/10.1007/978-3-030-46643-5_38 


Check for 
updates 


Combining Global Information 
with Topological Prior for Brain Tumor 
Segmentation 


Hua Yang!?, Zhiqiang Shen, Zhaopei Li’, Jinging Liu“), and Jinchao Xiao? 


1 College of Photonic and Electronic Engineering, Fujian Normal University, 
Fuzhou, China 
jq1iu8208@f jnu.edu.cn 
? Guangzhou Institute of Industrial Intelligence, Guangzhou, China 
xiaojinchao@sia.cn 
3 College of Physics and Information Engineering, Fuzhou University, Fuzhou, China 


Abstract. Gliomas are the most common and aggressive malignant pri- 
mary brain tumors. Automatic brain tumor segmentation from multi- 
modality magnetic resonance images using deep learning methods is 
critical for gliomas diagnosis. Deep learning segmentation architectures, 
especially based on fully convolutional neural network, have proved great 
performance on medical image segmentation. However, these approaches 
cannot explicitly model global information and overlook the topology 
structure of lesion regions, which leaves room for improvement. In this 
paper, we propose a convolution-and-transformer network (COTRNet) 
to explicitly capture global information and a topology aware loss to 
constrain the network to learn topological information. Moreover, we 
exploit transfer learning by using pretrained parameters on ImageNet 
and deep supervision by adding multi-level predictions to further improve 
the segmentation performance. COTRNet achieved dice scores of 78.08%, 
76.18%, and 83.92% in the enhancing tumor, the tumor core, and the 
whole tumor segmentation on brain tumor segmentation challenge 2021. 
Experimental results demonstrated effectiveness of the proposed method. 


Keywords: Brain tumor segmentation - Convolutional neural 
network - Transformer 


1 Introduction 


Gliomas are the most common and aggressive malignant primary brain tumors 
with the highest mortality rate and prevalence [16]. Magnetic resonance imaging 
(MRI) is one of the most effective tools for gliomas diagnosis in clinical practice. 
Multi-modal MRI can provide complementary information for the anatomical 
structure of tumors, where T1 weighted (T1) and T1 enhanced contrast (T1ce) 
images highlight the necrotic and non-enhancing tumor core, while T2 weighted 
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(T2) and fluid attenuation inverted recovery (Flair) images enhance the peritu- 
moral edema [17]. 

Accurate segmentation of brain tumors using MRI plays an important role 
in gliomas treatment and operative planning [6]. However, manual segmenta- 
tion of brain tumor is time-consuming and resource-intensive. The segmentation 
results relies on the experience of doctors and influences by inter- and intra- 
observer errors [19]. Therefore, automatic segmentation is required. Recently, 
deep learning-based methods, especially fully convolutional neural networks 
(FCN) have demonstrated dominant performance both in natural [2,15] and 
medical image segmentation [9,20,25]. Nevertheless, automatic brain tumor seg- 
mentation is still a challenge due to the extreme intrinsic heterogeneity in appear- 
ance, shape, and histology [17]. Examples of gliomas are shown in Fig. 1. 


Fig. 1. Examples of gliomas with various locations, appearances, shapes, and histology 
in MRI. Necrotic tumor cores, peritumoral edematous, and GD-enhancing tumor are 
highlighted by red, green, and yellow respectively. (Color figure online) 


206 H. Yang et al. 


Many studies have been proposed to solve the challenge of brain tumor 
segmentation [1,10,14,19,22]. Pereira et al. firstly investigated the potential of 
using CNN with small convolutional kernels for brain tumor segmentation [19]. 
Havaei et al. exploited two-pathway CNN to extract both local and more global 
contextual features simultaneously, and combined them to accurately segment 
gliomas [10]. More recently, Liu et al. proposed a multi-modal tumor segmenta- 
tion network with a fusion block based on spatial and channel attention to aggre- 
gate multi-modal features for gliomas delineation [14]. Ahmad et al. designed a 
context-aware 3D U-Net by using densely connected blocks in both en-coder 
and decoder paths to extract multi-contextual information from the concept of 
feature reusability [1]. Wacker et al. employed pretrained model to constructed 
U-Net encoder to stabilize the training process and to improve prediction perfor- 
mance [22]. Even though the above methods achieved favorable performance on 
gliomas segmentation, they cannot explicitly model global information. Long- 
range dependency, i.e., large receptive field, is crucial of a model to perform 
accurate segmentation [23]. These approaches implicitly aggregated global infor- 
mation by stacking several local operations, i.e., convolutional layers interlaced 
with down-sampling operators, where large amount of convolution layers stack- 
ing in a model may influence its efficiency and cause the gradient vanish by 
impeding the back-propagation process. Moreover, the topological information 
which can be prior knowledge to simplify the segmentation task is not considered. 

In this paper, we propose a convolution-and-transformer network (COTR- 
Net) combined with a topology-aware (TA) loss to not only explicitly model 
global information but also leverage topological prior to regularize network train- 
ing process. In addition, we exploit transfer learning by using pretrained ResNet 
[11] to initialize the encoder of COTRNet. Furthermore, we employ deep super- 
vision mechanism [13] into the decoder of COTRNet for predictions refinement. 
Specifically, COTRNet is improved from a U-Net-like architecture, where the 
encoder derives from ResNet and the decoder is the same as that of U-Net 
except the additional deep supervision outputs. TA loss is a weighted combina- 
tion of cross entropy loss and dice loss. To exploited topological prior, we modify 
the one hot coding by transforming each lesion region as the single connectiv- 
ity domain (SCD). The difference between the single connectivity domain coding 
and one hot coding is illustrated in Fig. 4. We evaluation the proposed method on 
brain tumor segmentation (BraTS) Challenge 2021 [3-6,17]. Experiment results 
demonstrate the effectiveness of the proposed method. 


2 Method 


In general, the COTRNet is represented by the combination of the network archi- 
tecture of COTRNet itself, and TAL loss. We detail the network architecture 
of COTRNet on Sect. 2.1 and the TAL loss on Sect. 2.2. Finally, we specify the 
implementation details on Sect. 3. 
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Fig. 2. Network architecture of COTRNet. 


2.1 Network Architecture 


Global information, i.e., long range dependency, is critical for medical image 
segmentation. Previous methods gradually capture long range dependencies by 
stacking local operators. Inspired by the detection transformer (DETR) [7] that 
used transformer to model global information explicitly of the input features, we 
propose COTRNet to model global information for brain tumor segmentation. 
The network architecture of COTRNet are illustrated in Fig. 2. COTRNet takes 
as input slices of size 4 x 224 x 224 where channel = 4 refers to the four modality 
and outputs the probability map of size 1 x 224 x 224. 

Overall, COTRNet is a U-Net-like architecture consisting of an encoder for 
feature extraction, a decoder for segmentation prediction, and several skip path 
for feature reuse. Specifically, the encoder is composed of an input convolutional 
block, and four residual blocks interleaved with transformer encoder layers [21] 
and max-pooling layers. The convolutional layers are initialized by the param- 
eters of ResNet18 pretrained on ImageNet. Four transformer encoder layers are 
inserted into the encoder for explicitly modeling global information. The dia- 
gram of the transformer encoder are shown in Fig.3. A convolution feature map 
are flattened as a sequence. Then, the sequence is inputted into a transformer 
encoder for modelling global information. Finally, we reshape the sequence to a 
matrix with the shape the same as the input feature map. COTRNet includes 
four skip paths where the feature maps from the encoder are transfer to the 
decoder for concatenation with those of the decoder. The decoder is the same 
as that of the vanilla U-Net except four addition output layers added for deep 
supervision. Each convolution layer is followed by a batch normalization layer 
and ReLU activation except the output layers. Each output layer is a convolu- 
tion layer with kernel size of one to transform the channels of the feature maps 
to the number of target classes. 
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Fig. 3. Diagram of transformer encoder. 


2.2 Loss Function 


To leverage topology prior of segmentation objects, we modify the one hot coding 
to the SCD coding and combined the coding mechanism with improved weighted 
cross-entropy and dice loss, which refers to topology-aware loss. The difference of 
one hot coding and SCD coding is illustrated in Fig. 4. One hot coding translates 
the region of a target label to the corresponding channel, while SCD coding 
considers the inclusion relation between labels and translates the region of a 
target label to a single connection domain. This coding mechanism is appreciated 
for the topology structure having the inclusion relation layer-by-layer. 
Formally, TA loss is formulated as, 


orca gre eT e r) EM pee errr (X °) (1) 


where controls the contribution of the Lwce and Ly pce to the total loss 
Lra. Experientially, A = 0.5 in our experiments. Y is the ground truth and Y 
the predicted mask. 

Further, the Cwog is defined as 


Cc M 
twee (YY) = EuD llogi +0 -wog -hN ) 


c=1 j=1 
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The Lpcpeg is denoted as 
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Fig. 4. Illustration of one hot coding and the single connectivity domain coding. 


where M refers to the total number of pixels of the input slice and C refers 
the total number of classes which is equal to four (NCR, ED, ET, and the 
background) in our task. we denotes the weighted coefficient of the csn class, 
which are set as w1 l, w2 = 5,w3 = 4,w4 = 5 in our experiments. Yj. iS jin 
ground truth pixel of class c, and ĝ;, is the corresponding predicted probability. 

Deep supervision is adopted by taking into account five levels outputs of the 
decoder with output size of 14 x 14, 28 x 28,56 x 56,112 x 112, 224 x 224 in the 
back-propagation. Therefore, For a batch containing N images, the loss function 


J becomes, 
J= L 2; So aalra (Es fa) (4) 


where Y;, is the din level of the itn ground truth in a batch of input images, 
and Y;, is the corresponding prediction. a; = 0.05, œ2 = 0.05, a3 = 0.2,a4 = 
0.3, a5 = 0.4 in the experiments. 


3 Implementation Details 


Pre-processing. We normalize the intensity of an MRI into [0, 1]. In training, 
slices contained foreground labels are extracted and resample to 4 x 224 x 224 
as network input. Data augmentation including random flip, random rotation, 
random crop is utilized in training process. In test, all slices of an MRI case 
are orderly inputted into the model to obtain predicted mask, and the overall 
prediction of a case are obtained by combining all the predicted mask of the 
slices. 


Post-processing. In testing phrase, predicted masks are resampled to the orig- 
inal size of the input MRI. Since a glioma is an entity in an MRI, we conduct 
the maximum connected domain operation to the whole predicted mask. 
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4 Experiments 


4.1 Dataset 


We evaluated the proposed method on BraTS 2021 dataset with 2000 cases 
which contains a training set with 1251 cases of MRI and the corresponding 
annotations, a validation set with 219 cases of MRI, a test set no available to 
participants. All MRI cases are multimodal data T1 weighted (T1), T1 enhanced 
contrast (T1lce), T2 weighted (T2) and fluid attenuation inverted recovery (Flair) 
images. Annotations comprise the GD-enhancing tumor, the peritumoral edema- 
tous/invaded tissue, and the necrotic tumor core [17]. The provided segmenta- 
tion labels have values of 1 for NCR, 2 for ED, 4 for ET, and 0 for background. 
We first evaluated the proposed method on training set through five-fold cross 
entropy and obtained preliminary results in unseen data of validation set. The 
final results on BraTS 2021 challenge will be obtained on the unseen testing 
data. 


4.2 Metrics 


Following the BraTS 2021 challenge, We adopted the Dice similarity coefficient 
(DSC) and Hausdorff distance (HD) to quantitatively evaluate the segmentation 
performance. DSC calculates the similarity between two sets, which is defined 


as follows, 


ANB 
DSC(A, B) = x = (5) 


HD measures how far two subsets of a metric space are from each other, 
which is defined as the longest distance between a point set A and the most 
adjacent point of set B: 


HD(A, B) = max isup inf ey inf d(b, a)} (6) 


4.3 Experimental Setting 


We conduct the experiments on PyTorch [18] which is accelerated by an NVIDIA 
GeForce GTX 1080 with 8G GPU memory. We use the Adam optimizer [12] with 
the learning rate of le-4. The network is trained over 20 epochs with a batch 
size of 2. 
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Table 1. Quantitative results on BraTS 2021 training set through five-fold cross- 
validation. COTRNetw/oPT: COTRNet without using pretrained parameters. COTR- 


Netw/oDS: COTRNet without using deep supervision. 
Method Necrotic tumor core | Pertumoral edematous Enhancing tumor 
U-Net 0.5157 0.6822 0.6721 
COTRNetw/oPT 0.5698 0.7296 0.6852 
COTRNetw/oDS 0.5670 0.7297 0.7010 
COTRNet 0.5874 0.7309 0.7273 

5 Results 


In the following, we reported the results on BraTS 2021 dataset. We conducted 
ablation study on BraTS 2021 training set through five-fold cross validation, 
which is presented on Sect. 5.1. Further, the preliminary results are obtained on 
the validation set and reported on Sect.5.2. The final results will be obtained 
by evaluating the proposed model on the test set. 


5.1 Results on BraTS 2021 Training Set 


We conducted the oblation study on the training set through five-fold cross val- 
idation. The train set was orderly split into five subsets according to image IDs. 
Note that these results were obtained on our own data split method, so that 
are not necessarily to comparable with other challenge submissions. Moreover, 
we used the DSC to evaluate the model performance and calculated the DSCs 
on NCR, ED, and ET, respectively. The quantitative results are presented in 
Table 1. COTRNet achieved DSC of 58.74%, 73.09%, and 72.73% in the necrotic 
tumor core (NCR), the peritumoral edematous (ED), and the GD-enhancing 
tumor (ET) segmentation, which is the best performance compared to other 
three methods. We randomly selected multiple cases to illustrate the segmenta- 
tion results, as shown in Fig. 5. Intuitively, the qualitative results conforms to 
the quantitative ones. 


5.2 Results on BraTS 2021 Validation Set 


For evaluation on the validation set, we trained our model on the whole training 
set and submitted the segmentation results to the challenge website to acquire 
segmentation performance. Different from the evaluation pattern on training set 
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in which we take into account the NCR, ED, and ET sub-regions according to 
the annotations, the segmentation labels of the different glioma sub-regions are 
considered in validation phrase. The sub-regions considered for evaluation are 
ET, TC, and WT. The results are listed in Table2. COTRNet achieved DSC 
of 77.60%, 80.21%, and 89.34% and HD of 24.9893, 19.6241, and 7.0938 in ET, 
TC, and WT, respectively. 


5.3 Results on BraTS 2021 Test Set 


For evaluation on BraTS 2021 test set, we trained our model on the whole 
training set and submitted the docker container of our trained model to the 
challenge website for testing. The results on BraTS 2021 test set are shown in 
Table3. COTRNet achieved DSC of 78.08%, 76.18%, and 83.92% and HD of 
28.2266, 34.4783, and 15.6148 in ET, TC, and WT, respectively. 


Table 2. Quantitative results on BraTS 2021 validation set. 


Metrics | ET (DSC) | TC (DSC) | WT (DSC) | ET (HD) | TC (HD) | WT (HD) 
Mean 0.7760 | 0.8021 0.8934 24.9893 | 19.6241 | 7.0938 
Std 0.2675 [0.2617 |0.1171 84.7575 | 66.0254 | 14.6603 
Median 0.8737 0.9174 | 0.9262 1.7321 [3 3.1623 
25quantile 0.7779 | 0.7836 0.8825 1 1.7320 | 2.2361 
75quantile 0.9230 | 0.9516 0.9465 3.6736 |9 6.1644 


Table 3. Quantitative results on BraTS 2021 test set. 


Metrics | ET (DSC) | TC (DSC) | WT (DSC) | ET (HD) | TC (HD) | WT (HD) 


Mean 0.7808 0.7618 0.8392 28.2263 | 34.4783 | 15.6148 
Std 0.2725 0.3256 0.2272 87.4218 | 84.5480 | 31.4253 
Median 0.8813 0.9191 0.9221 1.4142 3 3.6056 
25quantile | 0.7831 0.7618 0.8511 1 1.4142 2 


75quantile | 0.9349 0.9609 0.9510 3.2549 15.56 9.7723 
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Fig. 5. Qualitative results on BraTS 2021 training set. 


6 Discussion and Conclusion 


In this paper, we proposed the COTRNet to solve the Brain tumor segmen- 
tation problem. COTRNet leveraged the transformer encoder layers to explic- 
itly capture global information and adopted the topology prior of brain tumors 
by introducing topology constraints to the network training process. Moreover, 
transfer learning and deep supervision mechanism were also used to improve 
the segmentation performance. Experimental results on BraTS 2021 challenge 
demonstrated the effectiveness of the proposed method. 

Table 1 summarized the results on BraTS training set through five-fold cross 
validation. Analysing these results, we can draw three conclusions as follows. 


Effectiveness of Transformer Encoder Layers. COTRNet and two abolated 
methods, i.e., COTRNetw/oPT, COTRNetw/oDS, outperformed U-Net by a 
large margin, which demonstrated the effectiveness of transformer encoder layers 
to capture global information. 


Effectiveness of Transfer Learning. COTRNet exceeded COTRNetw/oPT 
by the DSC of 1.76%, 0.13%, and 4.21% in NCR, ED, and ET, respectively. 
This superiority is obtained by using pretrained parameters which facilitated 
the model to converge to optimal. 


214 H. Yang et al. 


Effectiveness of Deep Supervision. The performance between COTRNet 
and COTRNetw/oDS is very close. This is because deep supervision is exploited 
to gradually refine the segmentation details, as shown in Fig.5. Although these 
details is crucial for tumor delineation, they contribute relatively few to the DSC 
compared with masses of tumor. 

Our method inserted transformer encoder layers to the encoder-decoder 
architecture to explicitly capture the global information of input images for 
image segmentation. Since transformer is effective to model long range dependen- 
cies, a more efficient approach is to directly use transformer as feature encoder 
without convolution operations. However, transformer require large-scale of GPU 
memory and this is indispensable to achieve when the transformer layer directly 
takes images as input. Therefore, we first adopted several CNN layers for fea- 
ture dimension reduction, as did in DETR [7]. On the other hand, transformer 
takes sequence data as input, which can disentangle the image structure. There- 
fore, the subsequent CNN layers is employed to recover the image structure. 
Recently, transformer has been widely exploited in medical image processing 
[8,24]. Zhang et al. presented a two-branch architecture, which combines trans- 
formers and CNNs in a parallel style for polyp segmentation [24]. Chen et al. 
proposed a TransUNet in which the transformer encodes tokenized image patches 
from a convolution neural network (CNN) feature map as the input sequence for 
extracting global contexts [8]. However, these methods need large-scale GPU 
memory and this will not feasible for common users. Hence, we proposed a 
lighted transformer-based segmentation framework which needs only 8G GPU 
memory for network training. We will focus on simplify the transformer network 
and developed more efficient segmentation architectures in our future works. 
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Abstract. Gliomas are the most common primary malignant tumors of 
the brain. Magnetic resonance (MR) imaging is one of the main detec- 
tion methods of brain tumors, so accurate segmentation of brain tumors 
from MR images has important clinical significance in the whole process 
of diagnosis. At present, most popular automatic medical image seg- 
mentation methods are based on deep learning. Many researchers have 
developed convolutional neural network and applied it to brain tumor 
segmentation, and proved superior performance. In this paper, we pro- 
pose a novel deep learned-based method named multi-scale feature recal- 
ibration network(MSFR-Net), which can extract features with multiple 
scales and recalibrate them through the multi-scale feature extraction 
and recalibration (MSFER) module. In addition, we improve the seg- 
mentation performance by exploiting cross-entropy and dice loss to solve 
the class imbalance problem. We evaluate our proposed architecture on 
the brain tumor segmentation challenges (BraTS) 2021 test dataset. The 
proposed method achieved 89.15%, 83.02%, 82.08% dice coefficients for 
the whole tumor, tumor core and enhancing tumor, respectively. 


Keywords: Brain tumor segmentation - Convolutional neural 
network - Multi-scale feature 


1 Introduction 


Gliomas are the most common primary malignant brain tumors, which are 
caused by cancerous changes in glial cells in the brain and spinal cord. It is 
a very aggressive and deadly disease. And in highly developed industrialized 
countries, the incidence rate is increasing [13]. Accuracy tumor delineation could 
significantly improve the quality of nursing. Magnetic resonance (MR) imaging 
is an effective technology for brain tumors diagnosis. However, accurate diagno- 
sis of brain tumor relays on the experience of doctors, which is time-consuming 
and often suffer from human error. Furthermore, due to the large amount of 
data, manual segmentation is very difficult. Therefore, accurate and automated 
segmentation of brain tumor segmentation using MR imaging is critical for the 
potential diagnosis and treatment of this disease. To this end, the Brain Tumor 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 
A. Crimi and S. Bakas (Eds.): BrainLes 2021, LNCS 12962, pp. 216-226, 2022. 
https: //doi.org/10.1007/978-3-031-08999-2_17 


Automatic Brain Tumor Segmentation 217 


Segmentation Challenge (BraTS) provide a platform for participants to evaluate 
their models and compare their results to other teams by using The BraTS 2021 
dataset [1-4, 12]. The BraTS 2021 has two tasks: Brain Tumor Segmentation and 
Prediction of the MGMT promoter methylation status in mpMRI scans. In this 
work, we only focus on segmentation task. 

Fully Convolutional Neural Networks (FCN) greatly promotes the develop- 
ment of medical image segmentation. Especially, U-Net [16] and its variants 
[14,20,21] has achieved great achievements in the domain of medical image seg- 
mentation. At the same time, in the BraTS challenge, the segmentation of brain 
tumor by Network variation based on U-Net framework has also achieved very 
excellent results. For example, the latest submissions for MRI brain tumor seg- 
mentation literatures [7,9-11] are based on different variants of this structure. In 
BraTS 2019, Jiang et al. [7] proposed a end-to-end two-stage cascaded U-Net and 
won the first place. They divide the segmentation task into two stages. In the first 
stage, the variant structure of U-Net is used to obtain an initial segmentation 
result, and the result is concatenated with the original input image as the input 
of the second stage. In the second stage, the structure with two decoders is used 
to perform the segmentation task in parallel to improve the performance, and 
two different segmentation images are output. C. Liu et al. [9] proposed a novel 
multi-modal tumor segmentation network and designed a spatial constraint loss, 
which can effectively fuse complementary tumor information from multi-modal 
MR images. H. McHugh et al. [11] present a fully automated segmentation model 
based on a 2D U-Net architecture with dense-blocks. S. Ma et al. [10] proposed 
a new network based on U-Net, which uses residual U-shaped network as the 
main structure, and obtains good segmentation results. Although these methods 
achieved favorable performance in brain tumor segmentation, they didn’t con- 
sider the multi-scale information or feature recalibration, which leaves room for 
further improvement. 

In this paper, we propose a fully automatic brain tumor segmentation method 
named MSFR-Net. MSFR-Net consists of an improved encoder-decoder archi- 
tecture in which multiscale feature extraction and self attention mechanism is 
used. Specifically, we designed a multi-scale feature extraction and recalibration 
(MSFER) module, which can effectively utilize the features of multi-modal MR 
images learned from CNNs. In addition, considering the class imbalance problem 
of brain tumors, we integrate the cross entropy and dice loss by adding weighted 
coefficients to the loss items. We evaluated the proposed method on the BraTS 
2021. Experimental results shows that our method achieved dice coefficients 
(DSC) of 90.18%, 81.61%, 76.89% on the whole tumor (WT), tumor core (TC) 
and enhancing tumor (ET), respectively. The main contributions of our method 
are summarized as follows: 


1) We design a MSFER module by cascading CNN layers to extract multiscale 
features and by adopting channel-spatial attention to recalibrate features. 

2) We insert the MSFER module into a encoder-decoder architecture to develop 
MSFR-Net for accurate brain tumor segmentation. 


218 Z. Li et al. 


3) To solve the class imbalance problem, we proposed an improved weighted 
cross entropy and dice loss, where the class distributions are considered. 


2 Method 


In the following, we first describe the overall network architecture on Sect. 2.1. 
Then the multi-scale feature extraction and recalibration (MSFER) module are 
specified on Sect. 2.2. Finally, we elaborate on the weighted cross-entropy and 
dice loss on Sect. 2.3. 
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Fig. 1. The architecture of multi-scale feature recalibration network (MSFR-Net). It 
is compose of encoding structure (left side) and decoding structure (right side), and 
input is the concatenation of multi-modal MRI 2D slices. 


2.1 Network Architecture 


Overall, we utilize the U-Net [16] like encoder-decoder architecture as our base- 
line model. The encoder has four basic blocks interleaved with four down- 
sampling layers. The decoder includes four up-sampling layers interleaved with 
four basic blocks. The encoder and decoder are connected by four skip connection 
paths for feature concatenation. The basic block containing two CNN layers of 
U-Net are replaced with the MSFER module, which then construct the proposed 
MSFR-Net. The diagram of MSFR-Net is illustrated in Fig. 1. 

The first step is the encoding stage for feature extraction. We concatenated 
the multi-model MR image (T1, T1ce, T2, Flair) with size of 4 x 240 x 240 as the 
input of the network. The MSFER module includes multi-scale fusion extraction 
(MSFE) block and feature recalibration (FR) block. We will details this module 
in Sect. 2.2. The input is first flowed into a MSFER module for extracting fea- 
ture information of different scales and recalibration, and then down-sampled to 
gradually aggregating semantic information by sacrificing spatial information. 
The down sampling part is realized by 2 x 2 max pooling. The second step is the 
decoding stage for spatial information recovery and pixel level classification, in 
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which network architecture of each layer is consistent with the encoding stage. 
The feature maps from the encoder stage are concatenated with those from the 
decoder through skip connections. The final output layer of the network is a 
convolution layer with kernel size of one followed by a SoftMax activation for 
segmentation prediction. 


2.2 Multi-scale Feature Extraction and Recalibration Module 


y © Concatenate 
Input Feature maps Convolution + BN + Relu 
a Element-wise Addtion 


Fig. 2. The multi-scale feature extraction and recalibration (MSFER) module. Note 
that the subfigure of CBAM refers to [19] 


In most of the existing U-Net-based methods, the features from encoder are 
directly connected with those from the decoder. To the best of our knowledge, 
the low-level features contain more details information and high-level features 
include more semantic information. They do not take into account the comple- 
mentary information of different scale features, which will lead to performance 
degradation and even classification errors. In our work, we design the MSFER 
module, which fuse and recalibrate the features of different scales. 

Figure 2 illustrates the MSFER module. A MSFER module is composed of a 
MSFE block with three 3 x 3 convolution layers and FR block with the convo- 
lutional block attention module (CBAM) [19]. Each output feature maps of the 
three convolution layers of MSFE are concatenated, and then passed through 
the FR block. The reason for concatenated the feature maps of different con- 
volutions is that they have different receptive fields i.e. the feature of different 
scales. The concatenated features are used as the input of FR block. Moreover, 
the output features of FR block are added with the input features, which can 
improve the training efficiency [6]. Finally, we transfer the output channels to 
the required size using 1 x 1 convolution as the input of the subsequent network 
layer. 
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2.3 Loss Function 


We propose a weighted cross-entropy (WCE) and dice (WDSC) loss for brain 
tumor segmentation, which can solve the problem of class imbalance. Specifically, 
We utilize the WCE loss to reduce the imbalance in pixel level and the WDSC 
loss to alleviate the problem in region level. 

The WCE loss is represented as 


Ss M 
a 1 7 = 
Lwer (P,P) ==) ws >) [pi log bj, + (1 — pj.) log (A — ĝjs)] (1) 
c=1 j 


i] 


The WDSC loss is denoted as 


5 ` 
Lwpsc (BA) = 5M (: z apasi) (2) 


where S refers the total number of classes which is equal to four (the GD- 
enhancing tumor, peritumoral edematous/invaded tissue, necrotic tumor corel 
and the background) in our task, the p;, is jth ground truth pixel of class 
S, and pj, is the corresponding predicted probability. ws denotes the weighted 
coefficient of the Sth class. M refers to the total number of pixels of the input 
slice in a batch. 

The total weighted loss function (TW) is formulated as 


Lrw(P, P) = [oLwor (B: BR) + (1 — 0)Lwpsc (P B)] (3) 


i=l 


In general, the Lrw is a weighted combination of class weighted cross-entropy 
loss and class-weighted dice loss. P; is the ith ground truth of a batch of input 
images, and Ê; is the ith predicted mask of a batch of predictions. Where 0 
controls the contribution of the Lwceg and Ly psc to the total loss. 


3 Experiments 


In this section, we introduce the dataset used in the experiment on Sect. 3.1 and 
explain the evaluation indicators used on Sect.3.2. And then describe the pre- 
processing and post-processing methods used and the details of the experiment 
On Sects. 3.3 and Sect.3.4, respectively. Finally, the implementation details are 
specified on Sect. 3.5. 


3.1 Dataset 


The BraTS 2021 dataset contains the total number of 2000 cases separated 
into a training set, a validation set, and a test set. Each data has multi-modal 
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Flair 


Fig. 3. An example multi-modal MRI case of the BraTS 2021 dataset. 


MRI (T1, Tice, T2, Flair) as shown in Fig.3. All the imaging datasets have 
been segmented manually, by one to four raters specialists. The training set 
includes MR cases and the corresponding annotations GD-enhancing tumor (ET- 
label 4), peritumoral edematous/invaded tissue (ED-label 2), necrotic tumor core 
(NCR-label 1). The validation set contains 219 cases and their annotations are 
not provided to the participants. The test set is not publicly available to the 
participants. 


3.2 Metrics 


In our experimental results, we use DSC and Hausdorff_95 (95%HD) to assess 
the prediction performance. The DSC is a evaluate of similarity between the 
ground truth segment mask and the prediction segment mask, that the spatial 
overlap between the prediction results of brain tumor segmentation and the label. 
The difference between Hausdorff distance (HD) coefficient and Dice coefficient is 
that Dice coefficient is sensitive to the segmented internal filling, while Hausdorff 
distance is sensitive to the segmented boundary. BraTS use Hausdorff_95 that 
is the Hausdorff distance multiplied 95% in order to eliminate the influence of 
outliers in small sets. 
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3.3 Preprocessing 


The BraTS 2021 dataset is provided in Nifti file, we sliced the data for network 
training according to the ground truth masks and normalized them into [0, 1] by 
Z-score standardization method. Data augmentation includes random rotation, 
random flip, random crop is utilized in training process. 


3.4 Postprocessing 


The whole predicted mask for a raw CT scan is obtained by combining all slice 
segmentation masks. Them, the morphological operations are used to refined the 
segmentation masks. 


3.5 Implementation Details 


Our experiment is conducted in PyTorch [15]. In the training, the number of 
epochs are set as 20 and the batch size are set as 4. The models are trained via 
Adam optimizer with standard back-propagation [8] with the learning rate of 
a fixed value of le—4. Our experiments are run on an NVIDIA GeForce GTX 
2080Ti with 11G GPU memory. 


4 Result 


4.1 Results on the Training Set of BraTS 2021 


We conducted ablation experiments to investigate the advantages of our model. 
The performance of our method was evaluated through the 5-fold Cross Vali- 
dation on the training dataset. In this experiment, MSFR-Net compared with 
three different methods: 


— 2D U-Net: Basic U-Net with muti-model MRI as input. 

— 2D U-Net+M: The 2D U-Net using the proposed the MSFE block as shown 
in Fig. 2 

— 2D U-Net+M: Replace the MSFE block of 2D U-Net+M with CBAM. 


Table 1 shows the quantitative performance. MSFE-Net achieved the DSC 
results of 56.87%, 72.49%, and 69.55% in NCR, ED, and ET respectively. In 
addition, the visualization result of segmented brain tumors are shown in Fig. 4. 
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Table 1. Dice score (mean) of the proposed method on 5-fold Cross Validation. 


Method Necrotic tumor core | Pertumoral edematous | Enhancing tumor 
2D U-Net 0.5213 0.6923 0.6847 
2D U-Net+M_ | 0.5366 0.7175 0.6624 
2D U-Net+C | 0.5554 0.7219 0.6940 
2D MSFR-Net | 0.5687 0.7249 0.6955 


4.2 Results on the Validation Set of BraTS 2021 


Additionally, we evaluate the 2D MSFR-Net results on the BraTS 2021 validation 
set. The results are listed in Table2. For the WT, TC, and ET, our method 
obtained mean DSC of 90.18%, 81.61%, 76.89%, respectively. The corresponding 
results of the HD95 are 6.1562, 16.6548, and 30.2116, respectively. 


Table 2. Quantitative results on BraTS 2021 Validation set. 


Metrics | ET (DSC) | TC (DSC) | WT (DSC) | ET (HD) | TC (HD) | WT (HD) 
Mean 0.7689 | 0.8161 0.9018 30.2116 | 16.6548 | 6.1562 
Std 0.2741 0.2488 0.0911 93.5059 | 60.7865 | 9.6682 
Median 0.8745 —| 0.9228 0.9244 2 3 3 
Q5quantile 0.7724 | 0.8168 0.8872 1 1.4142 | 1.7321 
75quantile 0.9212 | 0.9535 0.9513 4.0923 |7.3581 | 5.7878 


4.3 Results on the Test Set of BraTS 2021 


Finally, we evaluate the 2D MSFR-Net results on the BraTS 2021 test set. The 
results are listed in Table3. For the WT, TC, and ET, our method obtained 
mean DSC of 89.15%, 83.02%, 82.08%, respectively. The corresponding results 
of the HD95 are 7.2793, 21.7760, and 17.0458, respectively. 


Table 3. Quantitative results on BraTS 2021 Test set. 


Metrics | ET (DSC) | WT (DSC) | TC (DSC) | ET (HD) | WT (HD) TC (HD) 


Mean 0.8208 0.8915 0.8302 17.0459 = | 7.2793 21.7760 
Std 0.2236 0.1405 0.2643 68.5424 | 13.2216 74.4397 
Median 0.8912 0.9281 0.9377 1.4142 3 2.2361 
25quantile | 0.8133 0.8912 0.8595 1 1.7321 1.4142 


75quantile | 0.9393 0.9564 0.9661 2.8284 6 6.8128 
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Fig. 4. The visualization result of segmented brain tumors. Different color coverage 
areas represent different tumors: green for WT, red for TC and yellow for ET. 


5 Discussion and Conclusion 


In this paper, we propose a novel network structure for brain tumor segmenta- 
tion by multi-scale feature extraction and recalibration (MSFER) module. By 
learning the context information of multi-scale feature maps and recalibration 
them, it can accurately capture the complementary features of different feature 
maps. We performed ablation experiments on the BraTS 2021 training set and 
evaluated MSFR-Net on the validation set. 

The major advantage of MSFR-Net is the using of multi-scale feature recali- 
bration. In the results of cross validation shown in Table 1, the DSC of 2D U-Net 
in NCR and ED are 52.13% and 69.23%, respectively, while the results with the 
MSFE block are 53.66% and 71.75%. This comparative experiment demonstrated 
the effectiveness of the MSFE block which can obtain multi-scale features for 
context information complementary. Similarly, by introducing CBAM into U- 
Net, the DSC results are 55.54% and 72.19%. In particular, the score of U-Net 
+ C network on ET has also increased from 68.47% to 69.40%, which shows 
that the module can effectively focus on the region of interest and recalibrate 
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the features. Finally, we tested the MSFR-Net and obtained the best DSC of 
56.87%, 72.49% and 69.55% in NCR, ED and ET segmentation, respectively. 
These results demonstrated the superiority of multi-scale feature recalibration. 
Although MSFR-Net has the superiority mentioned above, it can not explicitly 
model the global features which limited the segmentation performance. Inspired 
by the recent approaches that leveraging transformer [18] to explicitly learning 
global information [5,17], we will focus on integrating transformer with convo- 
lution layers to improve the segmentation framework in the future. 
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Abstract. In this paper we present a small and fast Convolutional Neu- 
ral Network (CNN) used to predict the presence of MGMT promoter 
methylation in Magnetic Resonance Imaging (MRI) scans. Our data set 
is “The RSNA-ASNR-MICCAI BraTS 2021 Benchmark on Brain Tumor 
Segmentation and Radiogenomic Classification” by U. Baid, et al. We 
focus on using the median (“middle-most”) cross section of a FLAIR 
scan and use this as the input to the neural net for training. This cross 
section therefore presents the most or nearly the most surface area com- 
pared to any other cross section. We are thus able to reduce the com- 
putational complexity and time of the training step while preserving the 
high performance extrapolation capabilities of the model on unseen data. 


Keywords: MRI scans - Convolutional Neural Network - 
Glioblastoma - MGMT promoter methylation 


1 Background 


Malignant brain tumors, such as glioblastoma, are a life-threatening condition 
with median survival rates being less than one year. However, the presence of 
O-6-alkylguanine DNA alkyltransferase (MGMT) promoter methylation in the 
tumor can be a favorable prognostic factor for glioblastoma [1]. Analysing the 
brain for the presence or indication of tumors, such as MGMT promoter methy- 
lation, often involves the surgical extraction of brain tissue samples. Following 
this procedure, the timeline for receiving the results of the genetic characteriza- 
tion of the tumor can be up to several weeks. Thus there are many incentives 
for the development of non-invasive solutions, such as imaging techniques, which 
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would ultimately lead to less invasive diagnoses and treatments for brain cancer 
patients and further lead to more optimal survival prospects [1]. 

Magnetic resonance imaging (MRI) scans are an effective non-invasive 
method for detecting glioblastoma through the detection of MGMT promoter 
methylation [1]. MRI also allows monitoring the status of a tumor in real-time 
[4]. FLAIR (fluid-attenuated inversion recovery) imaging is a newer and seem- 
ingly more sensitive MRI, with its images obtained with an inversion recovery 
sequence, characterized by having a long inversion time (TI) and a long echo time 
(TE) [7]. As discussed by Khademi et al. [6], FLAIR is effective for localizing 
pathology, achieved partly by intensifying the darkness of the cerebrospinal fluid 
(CSF) in contrast with white and grey matter. Khademi et al. [6] in particular 
discuss how FLAIR is favorable in detecting white matter lesions. One such issue 
with training machine learning (ML) algorithms on images is the acquisition of 
noise generated in the data due to misinterpretation of imaging artifacts which 
are generally identifiable by humans but currently a challenge for algorithms 
[6]. In this paper, the FLAIR dataset is chosen to train the ML algorithm for 
detecting MGMT promoter methylation due to the precedent seen in the papers 
above that FLAIR is a more rich data format than the others, leading to more 
effective models. 

Other imaging methods have been studied, such as that of Chen et al. [4] 
in which a data set of T1-weighted images (CE-T1W1) was used to train the 
algorithm. Compared to FLAIR, CE-T1W1 had a lower Dice score [4]. 

Overall, ML algorithms have the potential to improve the detection of the 
MGMT promoter methylation and thus improving patient survival rates. One 
such reason is the ability for the application of the algorithm over large quanti- 
ties of images which would take longer for a human to process. The key aspect 
to be cognizant of is the maintenance of detection accuracy of the ML algo- 
rithm compared to human judgement when searching for indicators of MGMT 
promoter methylation in the FLAIR images. 


2 Dataset 


In this section we describe our dataset fully and how we normalize it for ingestion 
by the net. The original dataset included 585 training samples with labels as 0 
or 1, indicating the presence of MGMT methylation or not. The dataset also 
includes 87 test samples with no labels. Training samples 00109, 00123, 00709, 
all had issues with the FLAIR scan data and so we ignore them. This leaves us 
with 585 - 3 = 582 training samples to work with. Each sample has four types 
of scans associated with it, and we choose to only use the FLAIR data, which 
is a series of cross sectional scans [1,2]. With each sample then we choose the 
median cross section of the FLAIR scan (See Sect. 2.1 below on how this is done). 
Each cross section is a DICOM, which we convert to a PNG and then resize to 
a standard size of 224 x 224 pixels, since the sizing is not consistent between 
samples. We convert to an RGB PNG because a DICOM is not a numerical 
matrix form of data, and would be unusable with a CNN. 
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2.1 Selection of the Median Cross Section 


Within this data set, the median (or “middle-most”) cross sections of the scans 
are selected, as these have the most area compared to the other cross sections. 
This will allow for the algorithm to learn more from an individual scan and reduce 
the computational resource usage while preserving extrapolation capabilities of 
the model on unseen data. To select the desired cross-section, we sort the files 
by name, using a Natural Sort (aka Human Sort), since the files are labelled 
Image-1.dcm, Image-2.dcm, etc. where the number represents a well ordering of 
the images from front to back when forming a complete 3d view, and the files 
are not initially sorted [1,2]. We then select the median cross section by picking 
the median index of the sorted list of file names. 

Since the image scans are in DICOM format, we convert them to PNG format. 

We end up with training data like the training batch below. Here 0 indicates 
that the patient does not have the MGMT promoter methylation, while 1 indi- 
cates that the patient does have the MGMT promoter methylation, as labelled 
in the training data set: 


Fig. 1. Batch of 9 FLAIR median cross section scans. 
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Clearly we can see that our choice of the median cross section results in 
choosing a cross-section with nearly maximal surface area. Contrast that with 
the following evolution in canonical geometrical ordering as presented by Johnson 
[5]: 


OCE CCELA 


JORDANA 
SEEEEEE 
aaae 


SOROROOOoE 
DOOOOOOOOE 
TTO 


Fig. 2. The geometric evolution of one patient’s FLAIR MRI [5]. 


3 Design 


A seven-layer deep Convolutional Neural Network (CNN) is used to predict the 
presence of glioblastoma in the MRI scans. The training data set is based on 
Baid et al. [2,9-12]. 


Simple CNN Predicted MGMT Methylation in Median FLAIR Section 231 


3.1 Net Architecture 


Below we present the string serialization of our CNN which is a modification of 
the architecture presented by PyTorch [8]. The padding for convolutions is valid 
padding: 


Net ( 

(convi): Conv2d(3, 6, kernel_size=(5, 5), stride=(1, 1)) 

(pool1): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) 
(conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1)) 

(pool2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False) 
(fc1): Linear(in_features=400, out_features=120, bias=True) 

(fc2): Linear(in_features=120, out_features=84, bias=True) 

(fc3): Linear(in_features=84, out_features=10, bias=True) 


Fig. 3. String serialization of the 7 layer CNN, two convolutional layers, two max 
pooling layers, and three linear (fully-connected) layers. 


3.2 Activation Function 


We apply a Rectified Linear Unit (ReLU) activation function to each layer, 
including the output layer. 


3.3 Number of Outputs 


Note that we emphasize that the neural net has 10 output nodes even though this 
is a categorical classification problem with 2 labels. We explain this in Sect. 4.3. 
To better visualize this in action, we present a net graph: 


Convolution 1 
(5 x 5) kernel g Convolution 2 
valid padding Max Pooling (5 x 5) kemel Max Pooling 


(2x2) valid padding (2x2) 


Input 


228 x 228 x 3 channels 224 x 224 x 6 223 x 223x 6 219x219x16 218x218 x 16 


120 


Fig. 4. Neural network architecture used in the paper. 
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4 ‘Training 


4.1 Software 


We used the FastAI v2 package for developing the model, as well as PyTorch, 
python3, and pandas 1.0 [14-17]. The code used to generate all results can be 
found on Kaggle [18]. 


4.2 Hardware 


We trained our neural net in a python Jupyter notebook in the Kaggle environ- 
ment. The notebook’s backend comes with a CPU and NVIDIA TESLA P100 
GPU [13]. We trained the model with GPU acceleration. 


4.3 Loss Function 


Our loss function is the Categorical Cross Entropy Loss function. 


10 
Loss = — È | y; : log ĝa, 
i=1 
since the output vector of our net is of length 10. Note that, because we only 
have two ground truth labels, 0 and 1, we convert them to vectors of length 10 
like so: 


Therefore positions 3 thru 10 of each label vector is always 0. 

Position 2 of the output vector of the net therefore indicates the prediction 
of the presence of methylation in the patient scan. 

The authors note that we should have used only 2 output nodes instead of 
10, to match the number of labels as this is a categorization problem. 


4.4 Optimizer 


Our optimizer is the Adam Optimizer, with 6, = 0.9, B2 = 0.99 and e = le~°. 
The learning rate used by the optimizer i.e. œ is determined in the Learning 
Rate section below. 
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4.5 Batch Size 


Our batch size is 64. 


4.6 Cross Validation 


We hold out a random 20% of the training set as a validation set while training, 
unseeded. Given that there are originally 582 training samples to work with in 
the raw dataset, the validation set therefore contains 582 * 0.20 = 116 samples. 
This leaves 582 - 116 = 466 training samples to work with. 


4.7 Learning Rate 


We determine the ideal learning rate below. Here we plot the loss against the 
chosen learning rate. We employ the LR Range test by Smith [3] to determine 
the rate. We start with a learning rate of 1e—07 and end with a learning rate of 
10. We iterate 100 times and stop when the loss diverges.: 


SuggestedLRs(1r_min=0.002290867641568184, lr_steep=6.309573450380412e-07) 
20 


18 


16 


Loss 


14 


12 


10 


i alee lh a ll i 
1077 107° 107° 1074 107? gga 107? 10° 
Learning Rate 


Fig. 5. Learning rate vs Loss 


4.8 Training Policy 


We then train the model by using the One Cycle Policy again by Leslie Smith 
[3]. We train for 10 epochs and use a max learning rate of 1e-02. 
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epoch train_loss valid_loss error_rate accuracy time 
O 1.179571 0.674959 0.413793 0.586207 00:02 
1 1.108988 0.729703 0.586207 0413793 00:02 
2 1.076067 0.665072 0.396552 0.603448 00:02 
3 1.050501 0.963969 0.586207 0.413793 00:03 
4 1.027446 0.749877 0.508621 0.491379 00:02 
5 1.000942 0.680456 0.405172 0.594828 00:02 
6 1.002666 0.681733 0.413793 0.586207 00:02 
7 0.978674 0.697218 0.474138 0.525862 00:02 
8 0.965975 0.710260 0.508621 0.491379 00:02 


9 0.951226 0.704067 0.517241 0.482759 00:02 


Fig. 6. Training results after 10 epochs. The error_rate and accuracy are w.r.t the 
validation set and not the training set. 


epoch train_loss valid_loss errorrate accuracy time 
O 1.179571 0.674959 0.413793 0.586207 00:02 
1 1.108988 0.729703 0.586207 0.413793 00:02 
2 1.076067 0.665072 0.396552 0.603448 00:02 
3 1.050501 0.586207 0.413793 00:03 
4 1.027446] | 0.749877 0.508621 0.491379 00:02 


5 1.000942 | 0.680456 0.405172 0.594828 00:02 
6 1.002666] | 0.681733 0.413793 0.586207 00:02 


7 0.978674] | 0.697218 0.474138 0.525862 00:02 
8 0.965975} |0.710260 0.508621 0.491379 00:02 


9 0.951226 70.704067 0.517241 0.482759 00:02 


Fig. 7. Same as previous figure, but we have highlighted some details. 


In Fig.7, we highlight some trends and features about Fig.6 that are of 
interest to us. In red, we show that the model was able to learn, since validation 
loss decreases after the 3rd epoch as shown by the downward grey arrow. In 
blue, we show that validation error was minimized at the 5th epoch, so it may 
have been better to stop training at the 5th epoch. We would also like to remark 
that this net trained incredibly fast, with each epoch taking only 2 to 3s. 
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5 Results and Discussion 


At this point, we have been able to achieve an area of 0.61680 under the ROC 
curve between the predicted probability and the observed target, when testing 
on the Public Test Set as specified for the “RSNA-MICCAI Brain Tumor Radio- 
genomic Classification” Kaggle competition [1]. This placed our result in the top 
25% of competitors on the public leaderboard in August 2021. When the private 
leaderboard was revealed at the end of the competition, we found that the model 
had achieved an area of 0.53363 under the ROC curve on the Private Test Set. 
We needed 170.6s of total runtime in the Kaggle notebook environment for the 
model to load, preprocess, train, and then predict on the dataset. 


5.1 Choice of Net Architecture over Others 


We are well aware of the existence more popular architectures like Resnet, and 
VGG. Our choice to use the architecture we chose came about due to the con- 
straints of the competition. Unfortunately, using a pretrained model and then 
applying something like transfer learning was not possible due to the competi- 
tion rules. We also had a constraint of around 10h of total running time of a 
submitted model on the Kaggle platform, so training a Resnet or VGG from 
scratch would not be feasible. Therefore, it became clear to us that we needed 
to use a simple network architecture. 


5.2 Results of Other Methods 


We attempted a few other methods that performed much worse on various dimen- 
sions. One method we tried was a feature engineered solution which involved 
counting the number of dark points or light points, or ratios thereof, and then 
classifying based on the feature count. Since we converted all our images to PNGs 
on a 0-255 grayscale, we could basically classify a point as “dark” if its intensity 
was less than 100 for example, and “bright” if its intensity was brighter than 160 
or so. Such decisions were made by qualitatively comparing the PNGs and the 
corresponding intensity values. We noticed when manually examining the PNG 
scans that for patients that had methylation, the methylated area would appear 
as a circle or ring in the brain cross section. Either the brain would appear dark, 
and the ring would be bright, or the brain would appear bright, and the ring 
would appear dark. See Fig.8 below, which presents methylation: 
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Fig. 8. This patient has methylation and shows a bright white aura against a dark 
brain. 


In light of this, we can see that the notion of a light and dark point will 
become useful. Note that we also consider the entire space outside of the brain 
in the cross-section as dead-space points, since they are completely black and 
would throw off any ratio calculations if we counted them as “dark points”. So 
exclude all of these dead-space points from being counted as “dark” points. We 
then tried to see if there was any clear rule that would work as a classifier, 
like: Out of all non-dead space points, if less than 20% of points are “dark” in 
a mostly “bright” brain, or vice-versa, then there must be methylation in the 
brain. The fundamental intuition behind this rule is to match what we were 
seeing in the actual methylated cross-sections with the dark ring on a bright 
brain, or a bright ring on a dark brain. Unfortunately, it seemed none of the 
attempted rules worked. The method performed very poorly with less than 0.45 
area under the ROC curve on the public test set. We can see that with more 
exploratory data analysis, the initial intuition no longer applied and there were 
clear counterexamples to our rules. In Fig. 9 below we present a counterexample 
of a patient with no methylation, yet presenting a large white circle on a dark 
brain scan. We can observe Fig. 1 for more counterexamples and false positives. 

Another method we tried was using 20-layer deep nets with 20+ cross sections 
as input, sampled at some regular interval. Thus, we were exposing more data 
to a larger neural net. In this sense, we basically took the method described 
in this paper, and just added more input data and more layers to the net, 
and trained for many more hours. Unfortunately, this method performed very 
poorly with less than 0.45 area under the ROC curve on the public test set. 
Therefore, it seems that training on less features is better. Our rationale for why 
this happened is as follows. There are only around 500 unique patient scans, 
each with about 500+ cross sections. So there are lots of features/dimensions 
per training sample but not enough training samples themselves. So the data 
is essentially incredibly high dimensional, making it hard for a net to learn 
anything when receiving the data. Sampling the middle-most cross section has 
a regularization effect /dimensionality reducing effect similar to Dropout layers. 
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0 


Fig. 9. This patient has no methylation and also shows a bright white circle against a 
dark brain. 


In the end, both performed worse (less than 0.45 area under ROC curve on 
public test set, not scored on private test set). It should also be noted that the 
first alternative method was extremely fast (read - around 2 min to run the entire 
method and predict on the test set), while the second method was extremely slow 
(read - took around 9h to run end-to-end). 


6 Conclusion and Future Work 


In summary, by focusing the training on the median cross sections of the FLAIR 
scans in the data set, the computational complexity is reduced. At the same 
time, the ability of the algorithm to extrapolate on this data to predict the 
presence of MGMT promoter methylation and glioblastoma in the FLAIR scans 
is preserved, thus leading to improved efficiency at this stage. 

An interesting application of this design is in an embedded system or oth- 
erwise resource constrained machine performing online learning on real time 
data that the system scans and then trains the net with. Our design would pro- 
vide excellent extrapolative capabilities in the system, while still being feasible 
due to using very little compute power relative to more data intensive methods. 

For future work it would be prudent to assess the training effectiveness on 
including other imaging types aside from FLAIR, such as Tlw, T1Gd, and T2 
scans which are available in the original data set. We anticipate that the perfor- 
mance would improve from this due to having more data. 
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Abstract. Gliomas are the most common primary malignant brain 
tumors. Accurate segmentation and quantitative analysis of brain tumor 
are critical for diagnosis and treatment planning. Automatically segment- 
ing tumors and their subregions is a challenging task as demonstrated by 
the annual Multimodal Brain Tumor Segmentation Challenge (BraTS). 
In order to tackle this challenging task, we trained 2D non-local Mask 
R-CNN with 814 patients from the BraTS 2021 training dataset. Our per- 
formance on another 417 patients from the BraTS 2021 training dataset 
were as follows: DSC of 0.784, 0.851 and 0.817; sensitivity of 0.775, 0.844 
and 0.825 for the enhancing tumor, whole tumor and tumor core, respec- 
tively. By applying the focal loss function, our method achieved a DSC 
of 0.775, 0.885 and 0.829, as well as sensitivity of 0.757, 0.877 and 0.801. 
We also experimented with data distillation to ensemble single model’s 
predictions. Our refined results were DSC of 0.797, 0.884 and 0.833; sen- 
sitivity of 0.820, 0.855 and 0.820. 


Keywords: Glioma segmentation - Non-local Mask R-CNN 


1 Introduction 


The incidence rate of primary brain tumors is 11-12 per 100,000 populations. 
Gliomas are the most common brain tumors, accounting for about 50% of the 
diagnosed brain tumors, and 26% of them are considered to be astrocytic tumors 
[1]. Glioblastoma (GBM) accounts for 50-60% of all gliomas, and it has the 
highest malignancy among gliomas. Gliomas exhibit different degrees of aggres- 
siveness and variable prognosis, contain various heterogeneous histologic sub- 
regions [1]. The inherent heterogeneity of Glioma is reflected in their radio- 
graphic morphologies [2], with different intensity profiles disseminated across 
multi-parametric magnetic resonance imaging (mpMRI) scans, depicting dif- 
ferent sub-regions and differences in tumor biological properties [3]. Conven- 
tionally used sequences include: T1-weighted sequence (T1), T1-weighted con- 
trast enhanced sequence using gadolinium contrast agents (T1Gd), T2 weighted 
sequence (T2), and fluid attenuated inversion recovery (FLAIR) sequence. Sub- 
regions of Glioma can be defined from mpMRI: the appearance of enhancing 
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tumor is typically hyper-intense in T1Gd when compared to T1; the non- 
enhancing as well as the necrotic tumor core are both hypo-intense in T1Gd 
when compared to T1; and the peritumoral edema is reflected by hyper-intense 
signal in FLAIR. The subregions of Glioma consist of three classes: Whole Tumor 
(WT), Tumor Core (TC), and Enhancing Tumor (ET). Example of each sequence 
and tumor subregions is provided in Fig. 1. 


Fig. 1. Manual segmentation of brain tumor sub-regions (Red: WT; Green: ET; Yellow: 
NCT/NET) overlaid with different mpMRI modalities. The columns in order: T1, 
T1Gd, T2, FLAIR. (Color figure online) 


Segmentation of brain tumors in multimodal MRI images is one of the most 
difficult challenges in medical image analysis because of their highly varied 
appearance and shape. Annotations of sub-regions of brain tumors are tra- 
ditionally performed manually by radiologists; however, manual segmentation 
is time-consuming, subjective, and difficult to achieve repetitive segmentation 
[4,8]. Accurate delineation of each tumor subregion is critical to patient’s dis- 
ease management and provide radiologists and neuro oncologist with preop- 
erative knowledge for appropriate therapeutic treatment guidance. There is a 
growing interest in computational algorithms to automatically address this task. 
The Brain Tumor Segmentation (BraTS) challenge [1,9-12] was launched and 
has now grown into an well-established competition that allows competitors to 
develop and evaluate their methods to address this challenge by providing a 
large dataset with accompanying delineations of the relevant tumor sub-regions. 
The sub-regions considered for evaluation are the “enhancing tumor” (ET); the 
“tumor core” (TC), which entails the ET, as well as the necrotic (NCR) parts of 
the tumor; and the “whole tumor” (WT) which entails the TC and peritumoral 
edematous/invaded tissue (ED). 

In the past few years, many algorithms were proposed to solve this problem. 
Compared with other methods, deep learning has been showing the best state of 
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the art performance for segmentation tasks in general. In this paper, we used 2D 
non-local Mask R-CNN to segment the sub-regions of Glioma. We experimented 
with the focal loss function to address the class imbalance problem from training 
set. We also applied data distillation to ensemble single model predictions and 
refined the segmentation results. 


2 Methods 


2.1 Dataset 


The dataset provided in the BraTS 2021 training phase consists of 1251 pre- 
operative mpMRI scans of glioblastoma (GBM/HGG) and lower grade glioma 
(LGG). The mpMRI scans consist of T1, T1Gd, T2 and FLAIR, and were 
acquired with different clinical protocols and scanners from 19 institutions. All 
the imaging datasets have been segmented manually, by one to four raters, follow- 
ing the same annotation protocol, and their annotations were manually revised 
by expert board-certified neuroradiologists. The labels in the provided data are: 
1 for NCR & NET, 2 for ED, 4 for ET, and 0 for everything else. The images 
were pre-processed with skull-stripping and co-registration to the same anatom- 
ical template and were resampled to the same resolution of 1 mm? and a 3D 
volume of 240 x 240x155. The classes considered for object classification are WT, 
ET and NCR & NET. 

The N4 bias field correction [13] was applied to the four mpMRI modalities to 
correct the low frequency intensity inhomogeneity. FLAIR, T1, T1Gd and T2 of 
each slice were normalized by subtracting the mean and divided by the standard 
deviation. Mean and standard deviation were calculated neglecting the image 
background. Brain patches were cropped out from images given 1-pixel (px) 
margin from the brain contour. Contrast stretching and histogram equalization 
were applied to the patches. FLAIR, T1, T1Gd and T2 patch from the same 
slice location made up a four-channel input and was resized to a resolution of 
256 x 256 x 4 px and the aspect ratio of the brain was preserved by padding 
with zero. 


2.2 Non-local Mask R-CNN 


In this paper, we experimented with region-based segmentation CNN [6] to inves- 
tigate its performance in image segmentation and we experimented on 2D set- 
tings to be computational efficiency. 

The Mask R-CNN network [7] is an extension of Faster R-CNN [5] with an 
additional branch to predict object’s mask. A Region Proposal Network (RPN) 
is used to pick out foreground and propose candidate with bounding boxes. Fea- 
tures of each candidate are extracted by a RolAlign Layer, then a segmentation 
CNN predicts the binary mask for each object, a classification CNN predicts 
the class score of masks and bounding box regression parameters which are used 
to further refine the bounding boxes. In our experiment, ResNet-101 plus Fea- 
ture Pyramid Network was employed to extract features (F1-F4, Fig. 2) from 
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the input images. Our method was an extension of Mask R-CNN which was a 
2D non-local Mask R-CNN, as shown in Fig. 2(A). 


T =I + f(softmax(6(I7 )o(i))9(T)) (1) 


A= f(softmax(9(I!)4(Li))g(T)) (2) 


The non-local network is used to capture long-range dependencies of the four- 
channel input I. The new input J’ is modeled in Eq. 1, where f, 0, ọ and g are 
embedding functions and were implemented as 1 x 1 convolution. Softmax was 
added along both dimension i and j, which was different from [14] where softmax 
was added along only dimension j. Thus, the model considers not only the rela- 
tionships between the ith position and other positions but also the relationships 
between all other position pairs when synthesizing the ith position. To reduce 
computation cost of the non-local network, the input J was down sampled to 
128 x 128 x 4. And the output of non-local network was resized to the original 
input size of 128 x 128 x 4. An attended input A modeled in Eq. 2 was concate- 
nated with resizing to each layer of feature pyramids to guide precise prediction. 
Figure 2(B) shows the architecture of non-local network. 
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Fig. 2. (A) architecture of non-local Mask R-CNN. (B) architecture of Non-local net- 
work. 
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2.3 Focal Loss 


The non-local Mask R-CNN network comes with two classifiers. The RPN clas- 
sifies proposals into background and foreground, while the classification CNN 
further classifies foreground class into different objects. Though this two-stage 
framework achieved top accuracy on varieties of tasks, class imbalance encoun- 
tered in the training data can be the central problem of misclassification of 
difficult examples. To tackle class imbalance, the focal loss function [15] adds a 
modulating term to the cross entropy loss and enables focus learning on hard 
negative examples. It’s a dynamically scaled cross entropy loss, in which the 
scaling factor decreases as confidence in the proper class grows. The focal loss 
function is given in Eq.3. When y = 0, the focal loss function is the standard 
cross entropy criterion. 


FL = —(1 — py) "log(pr) (3) 


2.4 Single Model Ensemble 


The data distillation ensembles the results from a single model run on the orig- 
inal unlabeled images as well as different image transformations (flipping, resiz- 
ing, rotation, etc.). Such transformations are usually used as data augmentation 
options in training and are proved to improve single model’s accuracy. In [16], it 
was proposed to generate new training annotations and improve over the fully- 
supervised baselines. In this paper, we used data distillation to ensemble results 
from the single model and improve the accuracy. Our ensemble function is given 
by Eq. 4s, where X is the input, Tk = (T1,..., Tk) is a set of transformation func- 
tions, and Tọ 1 is the corresponding inverse transformation function of Tp; fo is 
the segmentation branch generating object mask, f, is the classification branch 
generating mask score. 


Y = lM) + SAT TOT) A 


3 Results 


We randomly split the 1251 patients into 834 patients for training and 417 
patients for validation. Our development was built upon [17] and models were 
trained on a 8X A100 GPU server. Adam optimizer was used with learning rate 
initiated to 0.0001, G1 to 0.9 and G2 to 0.999. Augmentation options included 
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flipping up-down (flipud), rotation (a random angle from —7° to 7°) and Gaus- 
sian blurring (random variance from 0.7 to 1.3). Each augmentation option was 
applied with the probability of 0.5. We sampled model’s parameters every 5000 
steps for 150 epochs. Models were trained using a batch size of one. The training 
took about 3 days training. The best model was selected by WT dice similarity 
coefficient (DSC) (first criteria) and TC DSC (secondary criteria) using the val- 
idation patients. DSC is given by Eq. 5; sensitivity given by Eq.6 is also used to 
evaluate the segmentation performance. TP is the true positives; FN is the false 
negatives and FP is the false positives. 


2TP 
DSC = oTP FN 4+ FP (5) 
TP 


Our observation showed that each patient was presented with one gross 
tumor, therefore we generated our final predictions by keeping only the largest 
connected component in each volume to exclude possible false positives. Focal 
loss function was used both in the RPN and the classification CNN. To simplify 
the prediction process, we only applied the transformation of flipud in Eq. 4. 
Table 1 summarize our results by different methods on the 417 patients randomly 
selected for validation from the BraTS 2021 training phase. Figure3 shows an 
example case where the sub-regions of Glioma is well predicted. Figure 4 shows 
an example case where the sub-regions of Glioma is hard to predict. 


Table 1. DSC and sensitivity for 417 randomly selected patients from BraTS 2021. 


Non-local Mask R-CNN | + Focal Loss | + Ensemble 
WT DSC 0.851 0.885 0.884 
TC DSC 0.817 0.829 0.833 
ET DSC 0.784 0.775 0.797 
WT sensitivity | 0.844 0.877 0.855 
TC sensitivity | 0.826 0.801 0.820 
ET sensitivity | 0.775 0.757 0.820 
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Fig. 3. A patient whose sub-regions of Glioma (Red: WT; Green: ET; Yellow: 
NCT/NET) is well predicted on mpMRI using the model. The columns in order: 
T1Gd, T2, mannal segmentation overlaid on T1Gd, model’s prediction overlaid on 
T1Gd. (Color figure online) 


4 Discussion and Conclusion 


In this paper, we experimented with a 2D non-local Mask R-CNN in segmen- 
tation of sub-regions of Glioma, which includ Whole Tumor, Tumor Core, and 
Enhancing Tumor. The idea of implementing deep neural networks using the 
different types of images together (T1, T1Gd, T2 and FLAIR) resulted in a 
promising solution for the task of segmentation of the brain tumor. In addition, 
we discovered that a region-based network for semantic segmentation produced 
promising results with plenty of room for improvement. We found that some 
false positive were generated when hyper intense artifact observed on T2. One 
possible solution is to use FLAIR subtracting T2, and input the subtraction as 
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Fig. 4. A patient whose sub-regions of Glioma (Red: WT; Green: ET; Yellow: 
NCT/NET) is not delineated accurately from the model compared to human labels. 
The columns in order: T1Gd, T2, manual segmentation overlaid on T1Gd, model’s 
prediction overlaid on T1Gd. (Color figure online) 


another channel of input. It’s also worthy experimenting with T1Gd subtracting 
T1 as well. Another solution is to use different numbers of channels with respect 
to the variations of the signal intensity of voxels [18]. 

There are some limitations of this work. We experimented with 2D network 
to save computational cost, which lead to discontinuities in three-dimensional 
z-direction in the predicted results. A 3D non-local Mask R-CNN is still worth 
trying in future. And a comparison with other semantic segmentation network 
such as U-Net may be an interesting topic. 

The two classifiers from non-local Mask R-CNN suffer from class imbalance 
among objects. The focal loss function used in this paper was mainly aimed 
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to address this issue. And we found the DSC of WT was largely improved. 
Also, by only ensembling the prediction on flipped up-down image, the DSC and 
sensitivity of all Glioma sub-regions (except for TC sensitivity) was improved. 
Ensembling predictions to filter out false predictions gives us an encouraging 
working direction in future. 
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Abstract. In the development of technology, there are increasing cases 
of brain disease, there are more treatments proposed and achieved a posi- 
tive result. However, with Brain-Lesion, the early diagnoses can improve 
the possibility for successful treatment and can help patients recuper- 
ate better. From this reason, Brain-Lesion is one of the controversial 
topics in medical images analysis nowadays. With the improvement of 
the architecture, there is a variety of methods that are proposed and 
achieve competitive scores. In this paper, we proposed a technique that 
uses efficient-net for 3D images, especially the Efficient-net BO for Brain- 
Lesion classification task solution, and achieve the competitive score. 
Moreover, we also proposed the method to use Multiscale-EfficientNet 
to classify the slices of the MRI data. 


Keywords: Brain-Lesion - EfficientNet - Medical image preprocessing 


1 Introduction 


In recent years, the number of cases that have brain lesions increasing, according 
to the National Brain Tumor Society, in the United States, about 700,000 people 
live with a brain tumour, and the figure rises by the end of 2020 [20]. Compared 
with other cancers such as breast cancer or lung cancer, a brain tumour is not 
more common, but it is the tenth leading cause of death worldwide [17]. Accord- 
ing to United States statistics, An estimated 18,020 adults will die this year 
from brain cancer. Moreover, the brain lesion can have a detrimental impact on 
the brain of the patients and can make sequelae for the patients on the others 
organs or their brain. Nowadays, there are various methods to diagnose disease 
through medical images such as CT-scan, magnetic resonance imaging (MRI), 
and X-ray. 

A brain lesion is the abnormal sympathy of a brain seen on a brain-imaging 
test, such as magnetic resonance (MRI) or computerized tomography (CT). 
Brain lesions appear as spots that are different from other tissues in the brain 
[18]. By this method, the MRI can visualize the abnormal on the slide of the 
brain [19]. 
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 
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The goal of the 3D-CT scans images classification task is to evaluate var- 
ious methods to classify the brain lesions in the medical images correctly and 
efficiently [21]. Parallel to the development of Computer Vision, particularly 
the Deep Neural Network and Vision Transformer, multiple methods were pro- 
posed to classify the abnormal tissue in the organ through the images such as 
CT scans and MRIs. In recent years, significant advancement has been made in 
medical science as the Medical Image processing technique, which helps doctors 
diagnose the disease earlier and easier. Before that, the process is tedious and 
time-consuming. To deal with this issue, it is necessary to apply computer-aided 
technology because Medical Field needs efficient and reliable techniques to diag- 
nose life-threatening diseases like cancer, which is the leading cause of mortality 
globally for patients [5]. 

In this paper, we propose a method that uses 3D EfficientNet to classify MRI 
images, with a new approach to using EfficientNet with Multiscale layers (MSL) 
to classify slices of MRI images. With the 3D EfficientNet, the model can have 
higher performance on feature extraction and classification task. In contrast, 
MSL uses the feature on the slice of image and create low-quality features to 
create a better feature map for the classification task. In this experiment, we use 
the backbones of EfficientNet BO and EfficientNet B7 to perform an experiment 
and evaluation of our method. 


2 Related Work 


2.1 Image Classification 


Image classification is a task that attempts to classify the image by a specific 
label. The input of the problem is the image the output is the label of this image. 
In recent years, the development of computing resources leads to a variety of 
methods in Image classification such as VGG 16, ResNet 50, and DenseNet. 
These architectures get the competitive result in the specific dataset. With the 
images sequence dataset, from the previous methods, there are various methods 
of Convolution Neural Network (CNN) combined with RNN or LSTM have been 
proposed. In a few years nearby, some Vision Transformer methods, State of the 
art (SOTA) architecture combined with CNN and CNN 3D have been proposed. 
These architecture achieve the competitive result on the task they are applied 
with the performance also has a competitive response on the task they are applied 


[2]. 


2.2 Transfer Learning 


Transfer Learning is the method that applies the previously trained model on 
the large dataset we can not get access to on the new dataset. The merit of this 
method is we can use the previous model that has high performance to apply 
on feature extraction of our dataset, this is the reason why the model with the 
transfer learning method can achieve better accuracy while training with the 
small dataset [25]. 
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2.3 Brain-Tumour Classification 


Brain-Tumour classification is one the most popular tasks in medical image 
preprocessing [8]. The main goal of this task is to classify the brain lesions 
images in the set of images. With MRI images, the brain lesion is demonstrated 
in the dark or light spots, which are different from the others [23]. 

There are many methods such as segmentation model to improve the data 
inputs or Generative adversarial networks to increase the data numbers to 
improve the performance of the training process [22]. Moreover, in recent years, 
many network architectures have been proposed to improve the classification 
score of the task [6]. 


3 Dataset 


The dataset for the experiment is from BraTS 2021, the target of the dataset is 
for the brain lesions classification task [14] which is from RSNA-ASNR-MICCAI 
BraTS 2021 challenge [4]. This dataset consists of 585 cases for training, in each 
case includes structural multi-parametric MRI (mpMRI) scans and is formatted 
in DICOM. The exact mpMRI scans included four types are: 


— Fluid Attenuated Inversion Recovery (FLAIR) 
— Tl-weighted pre-contrast (T1w) 

— Tl-weighted post-contrast (T1Gd) 

— T2-weighted (T2) 


This dataset is seperated in two labels are 0 and 1 for the NGMT value, 
which is the diagnosis scale of Brain-Tumour Detection [15] (Figs. 1 and 2). 


MGMT_value: 0 


Fig. 2. Sample of sample of NGMT value 1 
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Regarding NGMT promoter methylation status data is defined as a binary 
label with 0 as unmethylated and 1 is for methylated [16]. In the challenge, this 
data is provided to the participants as a comma-separated value (.csv) file with 
the corresponding pseudo-identifiers of the mpMRI volumes [17] (study-level 
label). 


4 Method 


The method we propose in this paper is the classification method for the Brain 
MRI images data. The input is the Brain MRI Image data (in png, jpg or Dicom 
format). Then all images will be preprocessed and will be augmented before 
being trained by the 3D EfficientNet model. Then the model can be used to 
predict the NGMT value of the new Brain MRI Image data Following is the 
diagram of our method (Fig. 3): 


Image 
Preprocessing 
(normalize, resize) 


Brain MRI 


Image data 


Data 
Augmentation 


Classification 


(3D EfficientNet, SGD 
optimizer) 


Fig. 3. Method diagram 


With the 2D dataset, we create the data from slices of MRI images depend 
on 4 index: Flair, Tlw, T1Gd and T2. These 4 index can be created to four 
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dataset with different size for each dataset. By using CNN for the 2D images, 
we can ensemble and probing four data by the ratio 3:3:3:2 and 2:4:2:2 for the 
result of the experiment. 


4.1 Data Preparation 


After loading data, we resize all the images to the size (256, 256), then we split 
the dataset into the training set and validation set in the ratio of 0.75:0.25. After 
resizing and splitting the validation set, we rescale the data pixel down in the 
range [0, 1] by dividing by 255, in the MRI data, we can apply rescale data on 
the slices of the data, as the result, the scale of the data will in the range [0, 1]. 
Then we use the application of EfficientNet to preprocess the input. The input 
after preprocess is rescaled to the same input of the EfficientNet model. 


4.2 Data Augmentation 


Data Augmentation is vital in the data preparation process. Data Augmenta- 
tion improves the number of data by adding slightly modified copies of already 
existing data or newly created synthetic data from existing data to decrease the 
probability of the Overfitting problem. We use augmentation to generate the 
data randomly by random flip images and random rotation with an index of 
0.2. With the 2D slices, the augmentation apply on each slices of the MRI data 
(Fig. 4). 


Fig. 4. The result after data augmentation process 
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4.3 EfficientNet 3D 


EfficientNet 3D is the architecture that bases on state-of-the-art 2D EfficientNet 
architecture. This architecture usually is used for video classification tasks or 3D 
classification tasks [24]. This architecture has five main parts: Initial Convolu- 
tional Layer 3D, Mobile Inverted Residual Bottleneck Block 3D, Convolutional 
Layer 3D, Global Average Pooling and Fully Connected Layer. This architecture 
is the modified version for the architecture that uses ConvLSTM or traditional 
Conv3D layers and it gets competitive scores on the 3D dataset and video dataset 
[1]. In the experiment, we propose the method by using the input MRI images 
with the size 256 x 256 x 4 to the input of the architecture, after passing through 
Convolution layer 3D, Mobile Inverted Residual Bottleneck Block 3D, and the 
others Convolutional layer 3D for the feature extraction, then Global Average 
Pooling layer will create the feature vector for the classification process (Fig. 5). 


® Drift 
mb e 
<> X X @ No Drift 
—, : 
Initial : S Convolutional Global Fully 
Convolutional Layer 3D Average Connected 


Mobile Inverted 
Residual 
Bottleneck Block 
3D 


Layer 3D Pooling Layer 


Fig. 5. EfficientNet3D BO architecture 


4.4 Multiscale Efficient Net 


In the experiment, we explore that the drawback of using 3D-CNN is the mis- 
match of the information between the channel space. We approach a new method 
that uses the slices of the MRI, which are Tl-weighted pre-contrast slices. How- 
ever, the number of slices is adequate for the training process to achieve the 
well-performance, we propose to use Multiscale block to create the high-quality 
feature and low-quality feature to ensemble the quality of the feature, then this 
feature concatenate with the EfficientNet block for the output of the architecture 
(Fig. 6). 
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Fig. 6. Multiscale EfficientNet architecture 


From the Input layers with the shape 256 x 256 x 3, there are two ways for 
the input is the Multiscale Block and EfficientNet Block. 

We use a Multiscale block containing two Max Pooling layers with two Convo- 
lution 2D layers for creating the low-quality feature and for the feature extraction 
of this feature. 

This feature has an integral part of the ensemble and carries more features 
from the first layers of the MRI slices. From this feature, when add with the high- 
quality feature, the model can get better performance on feature extraction. With 
the EfficientNet block, the high-quality feature is extracted as the traditional 
CNN, then the feature output of this block concatenates with the feature of 
Multiscale block to create the vector with shape output for the classification 
process. 


4.5 Loss Function 


To evaluate the performance of the model on the training process, we propose 
to use binary cross-entropy to judge the performance of the model. 


N 
) === u ogu) +0- y) “lost Plu) A) 


Above is the formula for Binary-cross entropy, it is suitable for our binary clas- 
sification problem. 
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4.6 Optimizer 


To get the global minimum in the training process. We do various experi- 
ments with optimization such as Stochastic Gradient Descent [12], Adam [9] 
and Adadelta optimizer [11]. After these experiments, we decide to choose the 
Adam optimizer because of the merit of the Adam optimizer and the perfor- 
mance of this optimizer on learning rate 0.0001 and the decreasing slightly of 
validation loss. 

Below is the updating formula each weight for Adam optimizer: 


(2) 


Met 
We = W1 — N—— 

t t—1 Vu + A 
With adam optimizer, the weight will be updated by the average of the square 
of the previous slope and it also keeps the speed of slope in the previous as 
momentum [9]. 


5 Evaluation Metrics 


The evaluation of the experiment is demonstrated through an area under the 
ROC curve (AUC), this is the scale to evaluate the binary classification. For a 
predictor f, an unbiased estimator of its AUC can be expressed by the Willcoxon- 
Mann-Whitney statistic [7]: 


ped? ae 1[f (to) < f(t1)] 


AUC(f) = IDo- D1] 8) 


In this way, it is possible to calculate the AUC by using an average of a number 
of trapezoidal approximations, it can help to improve the fair in the evaluation 
phase. 


6 Evaluation 


The following parameters are setup for in this experiment (Table 1): 


Table 1. The parameter setup for model training 


Parameter Value 

Optimizer Adam 

Learning rate | 0.0001 

Backbone EfficientNet BO 
Loss Binary Crossentropy 
metrics AUC 
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In the competition, we get an AUC score of 0.60253 on the Test dataset with 87 
cases, which is a competitive score. Our methods get a competitive result when 
compare with the other methods on the same dataset. Below is our experimental 
evaluation with different optimizers with EfficientNet 3D (Table 2): 


Table 2. The evaluation on each optimizer 


Optimizer | Evaluation 


Adam 0.60253 
Adadelta 0.60124 
SGD 0.60178 


RMSPROP | 0.60223 


These evaluations are saved on 100 epochs with the best weight which is 
evaluated on the validation AUC metrics. After that, we use the Early Stopping 
method to improve the AUC score of the model by optimizing the calculation of 
gradient in the optimizer. 

For comparison between two approaches and methods, we benchmark two 
methods with the same test dataset from the organizer (Table 3). 


Table 3. Benchmarking for two methods 


Method AUC 
EfficientNet 3D 0.60253 
Multiscale EfficientNet B7 | 0.67124 


From the benchmarking table, it is obvious that the performance of the Multi- 
scale EfficientNet B7 is better than the performance of EfficientNet 3D in AUC. 
However, there are some drawbacks to this method in computing resources. 
Because creating two types of features are low-quality and high-quality features, 
the time for computing increases for this process, this is the drawback of this 
method for running on the lack of computing resources. 


7 Conclusion 


We demonstrated the proposal of using EfficientNet 3D to classify endoscopic 
images. The result of our research is competitive on the AUC evaluation metric. 
In our method, we use EfficientNet 3D with Adam optimizer and Early stopping 
method to improve the performance of the model on the training process to 
achieve the competitive score. Moreover, we also apply data augmentation to 
reduce the overfitting problem of the model on the test dataset. However, there 
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are some drawbacks that we have to do to improve the performance of the model, 
such as pre-processing data, reducing the noise of the training dataset. 

Furthermore, we can apply the better backbone of EfficientNet 3D, or we 
can use the approach of Transformer or spatial Attention modules to have a new 
approach to per frame of the sequence images. This new approach can get better 
feature extraction and better performance on the test dataset. 


8 Future Work 


Although our method gets a competitive score, there are some drawbacks in our 
methods: the training time gets long with 85s/epoch, we can custom layers in 
the architecture to accelerate the computing cost. We can get more layers or 
ensemble more backbones to achieve higher results. 

Another method we can approach by classifying each frame of image in the 
sequence of images, by applying transfer learning methods with the previous 
backbone, this method can achieve the higher score and reduce the overfitting 
problem with the small training dataset. 
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Abstract. Tumor segmentation of brain MRI image is an important and 
challenging computer vision task. With well-curated multi-institutional 
multi-parametric MRI (mpMRI) data, the RSNA-ASNR-MICCAI Brain 
Tumor Segmentation (BraTS) Challenge 2021 is a great bench-marking 
venue for world-wide researchers to contribute to the advancement of the 
state-of-the-art. HarDNet is a memory-efficient neural network backbone 
that has demonstrated excellent performance and efficiency in image 
classification, object detection, real-time semantic segmentation, and 
colonoscopy polyp segmentation. In this paper, we propose HarDNet- 
BTS, a U-Net-like encoder-decoder architecture with HarDNet back- 
bone, for Brain Tumor Segmentation. We train it with the BraTS 2021 
dataset using three training strategies and ensemble the resultant models 
to improve the prediction quality. Assessment reports from the BraTS 
2021 validation server show that HarDNet-BTS delivers state-of-the-art 
performance (Dice_ET = 0.8442, Dice_TC = 0.8793, Dice-WT = 0.9260, 
HD95_ET = 12.592, HD95_TC = 7.073, HD95_WT = 3.884). It was 
ranked 8th in the validation phase. Its performance on the final testing 
dataset is consistent with that of the validation phase (Dice_ET = 0.8727, 
Dice_TC = 0.8665, Dice WT = 0.9286, HD95_ET = 8.496, HD95_TC = 
18.606, HD95_WT = 4.059). Inferencing an MRI case takes only 16s of 
GPU time and 6GBs of GPU memory. 


Keywords: Brain tumor segmentation - Medical imaging - Neural 
network - Deep learning 


1 Introduction 


A brain tumor is a mass of abnormal cells in the brain. There are many types 
of tumors, cancerous (malignant) or noncancerous (benign). In the treatment of 
brain tumors, there are usually surgical resection, radiation therapy and systemic 
drug therapy. When diagnosing which treatment method to use, it is necessary 
to be able to accurately see the location, scope and volume of the tumor, but it is 
not so easy to complete the above conditions, and it often requires an experienced 
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neurosurgeon to complete it. Automatic segmentation of tumor mass region from 
a Magnetic Resonance Imaging (MRI) scan data is a practical approach. 

Recent development in deep learning has shown remarkable progress in many 
computer vision tasks such as image classification, object detection or tracking, 
and semantic or instance segmentation. The field of medical image segmentation 
also benefit greatly from these progresses. For colonoscopy polyp segmentation 
and brain tumor segmentation, U-Net [19] employed an encoder-decoder archi- 
tecture that achieved breakthrough performance and inspired many improve- 
ments [6,13]. 

To make a deep learning approach practical, both network architecture design 
and labeled dataset readiness are essential. Compared with popular ImageNet or 
COCO datasets, medical data is more difficult to obtain because it takes many 
experienced physicians long time to label, not to mention privacy, ethical and 
legal issues. Fortunately, the Brain Tumor Segmentation Challenge (BraTS) [1- 
4,15] stages a platform with expert-labeled dataset and standardized assessment 
metrics for fair comparison. Over the past ten years, it has greatly facilitate rapid 
progress of the field [11, 12,21, 22]. 

The BraTS 2021 dataset [1] consists of over 2,000 cases and is split into 
1,251 for training, 219 for validation, and 570 for testing. Each data has four 
MRI modalities of (a) native(t1), (b) post-contrast T1-weighted (t1Gd), (c) T2- 
weighted (t2), and (d) T2 Fluid Attenuated Inversion Recovery (t2-FLAIR). 
Each case is a 3D Image of NIfTI files (.nii.gz format), the image size is 
240 x 240 x 155, and the ground truth tumor regions are labeled as necrotic 
tumor core (NCR - Label 1), peritumoral edematous (ED - Label 2), and GD- 
enhancing tumor (ET - Label 4). Training and validation data are available 
to the participants, but, only the training ground truth is given. Scoring the 
prediction on the validation data against unseen ground truth is done in the 
challenge organization’s servers. Test data is hidden from the participants. Like 
previous BraTS challenges, participating models will be assessed with the “Dice 
Similarity Coefficient” and the “Hausdorff distance (95%)”. 

For the 2021 BraTS Challenge, we propose HarDNet-BTS based on a U-Net- 
like encoder-decoder architecture and a memory-efficient backbone called HarD- 
Net. In this paper, we will describe the network design, our training strategies, 
and the experiment results evaluated by the official validation server. 


2 Method 


We first present the proposed neural network architecture. Then we describe how 
we pre-process and augment the training data. Finally, we report the selection 
of loss function and how to train and ensemble models. 


2.1 Proposed HarDNet-BTS Network Architecture 


Figure 1 depicts our proposed HarDNet-BTS neural network for brain tumor 
segmentation. It is inspired by the encoder-decoder architecture popularized 
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by U-Net [19] and our previous experience with FC-HarDNet [5], which was 
the state-of-the-art in real-time semantic segmentation on the Cityscape dataset 
from 2019/07-2021/01 according to PapersWithCode. After two stages of vanilla 
3X3X3 convolution (colored gray), we replace all convolution pipes with HarD- 
Net blocks (colored blue or orange and to be elaborated later). The first stage has 
32 channels while the second 64. Successively, we halve the resolution by down- 
sampling and double the number of channels. Skip connections are employed to 
transport information from the encoder side to the decoder side. The activation 
function is Mish [17]. All down-samplings are done with Soft-Pooling [20], and 
all up-samplings are done with tri-linear interpolation. Deep Supervision uses 
1 x 1 x 1 convolution to predict the background and number of classes before 
up-sampling. 
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64x64? Prediction 


32x128% 
32x64? 


F 128x64? $ 


128x32? 


3x128? ssop 
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XX% CONV. - -Mis! ayer 3 A 
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© Concatenate 


Fig. 1. HarDNet-BTS architecture overview. 


Spatial information is essential to our segmentation task. After successive 
down-sampling operations in the encoder side, the feature map resolution is 
reduced to a very small size. Therefore, it is difficult to generate an accurate 
mask. Through three skip connections, we concatenate the feature maps from 
corresponding encoder and decoder stages to enhance the model’s information on 
the spatial domain, and, hence, help integrating low and high level information 
to generate better masks. 

A HarDNet convolution block as illustrated in Fig. 2 is an improved version of 
DenseNet [10]. Chao et al. [5] invented the HarDNet block based on their observa- 
tion of off-chip memory traffic needed during inference. It simplifies the shortcut 
patterns in the Denseblock. Figure 2 shows two versions (8-layer and 16-layer) 
of HarDNet blocks employed in our proposed network. Unlike a Denseblock that 
connects every stage to every other stages, HarDNet’s harmonic-wave-like con- 
nection pattern significantly reduces the amount of off-chip DRAM access. It has 
been open-sourced and applied to many computer vision tasks including image 
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classification, object detection, semantic segmentation, and medical segmenta- 
tion. Especially, the fully convolutional FC-HarDNet for real-time semantic seg- 
mentation and HarDNet-MSEG [9] for colonoscopy polyp segmentation both 
achieved state-of-the-art (SOTA) performance according to PapersWithCode. 
Due to HarDNet’s efficient memory usage and hence faster inference speed, 
we can employ many more sophisticated methods to achieve better results. For 
example, we can replace simple activation functions such as ReLU [18] and Leaky 
Relu with more sophisticated Mish [17], and AvgPooling and MaxPooling with 
SoftPooling [20]. Furthermore, we can use high precision 32-bit floating numbers 
(FP32) instead of half-precision FP16. All these lead to higher accuracy. 


Output 


Fig. 2. A 8-layer (blue) and a 16-layer (orange) HarDNet block. Shortcut connections 
follow a harmonic wave pattern and channel numbers vary. (Color figure online) 


2.2 Data Pre-processing 


Pre-processing and normalization of the input data facilitate the model to better 
extract the essential features. We apply the following pre-processing suggested 
by Theophraste Henry et al. [8]. First, we remove the dark boundaries of the 
four modalities of an MRI data to cope with the problem of data imbalance 
and misleading prediction. Then we normalize the data values by (1) calculating 
the distribution of non-zero voxels in the images, (2) identifying the 1 and 99 
percentile as min and max, respectively, and (3) min-max scaling the images. 
Finally, we employ random cropping to get the image size at 128 x 128 x 128 or 
144 x 144 x 128 depending on the training options to be described later. 
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2.3 Data Augmentation 


Data augmentation can increase the diversity of data, reduce the probability of 
over-fitting and enhance the robustness of a model. We use the following data 
augmentation. 


— scaling each voxel to the range between 0.9 and 1.1 
— adding some Gaussian noise to the images 

— randomly dropping one of the four input channels 
— flip and transpose 


2.4 Loss Function 


Commonly used loss functions for medical image segmentation networks include 
dice loss [7,16], cross entropy loss [11], and focal loss [14]. We employ dice loss 
(DL2, Eq. 1) and dice-and-cross-entropy loss (DLCEL, Eq. 4) defined below. 


DL2=1 1 
N & A+B; +e (1) 


1 ` A;i*x Bi+e 


DL=1 2 
N 4 Ai+Bi+e (2) 


output 
size 


N 
1 3 1 3 
i=l 


size j=l 


DLCEL =0.8:DL+0.2-CE (4) 


In the equations above, A is the model prediction result, B the ground truth, 
and e a smoothing factor. The difference between two versions of Dice Loss (DL2 
of Eq. 1 and DL of 2) is in whether A and B in the denominators are squared or 
not. Equation 4 defines a compound loss DLCEL as a weighted combination of 
DL (Eq. 2) and CE (Eq. 3). N = 2 or 4 is the background and number of classes 
of the task. We would like the loss function to take into account these channels 
simultaneously. Therefore, we calculate each channel’s loss separately, and use 
their average as the final value. 

Deep supervisor produces an output for each layer of the decoder. We calcu- 
late a loss value for each layer, and optimize for their average. 


2.5 Training and Model Ensemble 


We train the proposed HarDNet-BTS using three strategies to obtain three 
model versions and their ensemble as following: 
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— Version 1. Input size: 128 x 128 x 128, batch size: 6, and loss function: 
dice-and-cross-entropy loss (DLCEL, Eq. 4). 

— Version 2. Input size: 144 x 144 x 128, batch size: 4, and loss function: dice 
loss (DL2, Eq. 1). 

— Version 3. Same as Version 1 except that we train for ET, TC, and WT 
separately, and merge the results afterwards. 

— Ensemble To average the voxel confidence values of all three version’s 
output. 


2.6 Inference 


To segment a test data, we first use the same data pre-processing as that of 
the training phase. Then, we use Test Time Augmentation (TTA) to generate 
16 different data via data flipping. Each of the three versions of trained model 
predicts these 16 data producing totally 48 results. Finally, the average of all 
results is the prediction result. 

Our models output are ET, TC, and WT, but the ground truth labels are in 
ET, NCR, and ED. So we reconstruct NCR by removing ET from TC, and ED 
by removing TC from WT. 

In terms of speed, if only Version 1 is used and TTA is not used, an image 
takes about 0.25s of GPU time and 6GBs of GPU memory. For higher accuracy 
requirements, with model ensembling and TTA, an image will take about 16s. 


3 Results 


We have implemented the proposed HarDNet-BTS neural network in PyTorch 
1.9.0 and trained it using two GPUs (NVIDIA Tesla V100 32GB). We employ 
the Ranger optimizer, set the initial learning rate at le—4, train the network 
for 1,400 epochs, and fine-tuned it for 150 epochs. Figure3 shows some sample 
cases of training data, ground truth, and predictions by HarDNet-BTS. 

We have enrolled HarDNet-BTS into the BraTS 2021 challenge. The official 
evaluation metrics are: dice coefficient (Dice ET, Dice_-TC and Dice_ET) and 
Hausdorff distance 95% (HD95_ET, HD95_TC and HD95_WT), the former is 
the greater the better while the latter is the opposite. Tables 1 and 2 give the 
segmentation scores for the training set (1251 cases) and validation set (219 
cases), respectively. In the tables, we list three values (enhancing tumor (ET), 
tumor core (TC), and whole tumor (WT)) of both dice coefficient (Dice) and 
Hausdorff distance 95% (HD95) for each of the three versions of trained models as 
well as their ensemble. Table 3 further show the detailed stats on the evaluation 
report of the validation set predicted by the ensemble model. Figure 4 gives the 
box plots of the same information. 
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Input Ground Truth Prediction 


Fig. 3. Sample visual results of the proposed HarDNet-BTS. The labels include 
NCR(red), ED(green) and ET(blue). (Color figure online) 


Table 4 shows the detailed evaluation report of the test data set provided by 
the challenge organizer. Figure 5 compares the box plots of the dice coefficients 
of both validation and testing datasets. It can be seen that our model is very 
robust. 
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Table 1. The segmentation results of the BraTS 2021 training dataset (1251 cases). 


Models Dice HD95 

ET TC WT ET TC WT 
Ver. 1 0.9106 | 0.9511 0.9520 | 7.321 | 4.908 | 4.053 
Ver. 2 0.9123 | 0.9557 | 0.9523 6.074 | 3.522 | 3.177 
Ver. 3 0.9075 | 0.9523 0.9596 | 5.098 | 6.508 | 3.682 
Ensemble | 0.9164 | 0.9565 0.9593 6.385 | 5.132 | 3.064 


4 Discussion 


We have presented HarDNet-BTS, an encoder-decoder neural network with 
a memory efficient CNN backbone, for brain tumor segmentation. We have 
described the reasons behind network architecture design, loss function selec- 
tion, data augmentation, and model ensemble strategies. We have participated 
in the RSNA-ASNR-MICCAI Brain Tumor Segmentation (BraTS) Challenge 
2021. Validation results ranks 8th among all participants and testing results 
show consistent quality. 

Due to GPU resource limitation, we cannot experiment with many data 
augmentation and training techniques. In the future, we would like to investigate 
more on these possibilities. Observing the box plots, we see some outliers that 
need further investigation. 


Table 2. The segmentation results of the BraTS 2021 validation dataset (219 cases). 


Models Dice HD95 

ET TC WT | ET TC | WT 
Ver. 1 0.8375 | 0.8727 | 0.9220 | 15.893 | 8.862 | 4.065 
Ver. 2 0.8374 | 0.8759 | 0.9229 | 12.555 | 8.630 | 4.108 
Ver. 3 0.8386 | 0.8816 | 0.9258 | 12.655 | 7.836 | 3.764 
Ensemble | 0.8442 | 0.8793 | 0.9260 | 12.592 | 7.073 | 3.884 


Table 3. Statistics of the prediction of the validation dataset. 


Dice HD95 
ET TC WT ET TC WT 
Mean 0.8442 | 0.8793 | 0.9260 | 12.592 7.073 | 3.884 


StdDev 0.2084 | 0.1820 | 0.0755 | 60.651 | 35.492 | 7.428 
Median 0.9012 | 0.9415 | 0.9467 | 1.414) 1.732 | 2.236 
25 quantile | 0.8458 | 0.8848 | 0.9086 | 1.000 1.000 | 1.414 
75 quantile | 0.9534 | 0.9660 | 0.9680 | 2.236 4.000 | 3.673 
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Table 4. Statistics of the prediction of the testing dataset. 


Dice HD95 
ET TC WT ET TC WT 
Mean 0.8727 | 0.8665 | 0.9286 8.496 | 18.606 | 4.059 


StdDev 0.1727 | 0.2457 | 0.0885 46.714 | 70.541 | 8.007 
Median 0.9254 | 0.9551 | 0.9566 1.000) 1.414 | 1.732 
25 quantile | 0.9051 | 0.9165 | 0.9086 | 1.000) 1.000 | 1.000 
75 quantile | 0.9600 | 0.9774 | 0.9756 | 2.000) 3.121 | 3.741 
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Fig. 4. Box plots of Dice and HD95 of the validation dataset (219 cases). 


Dice Coefficient boxplot of Validation Dataset(Blue) and Testing Dataset(Orange) 


ET TC WT 


Fig. 5. Box plots of Dice coefficients of the validation dataset and testing dataset. 
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Abstract. Semantic segmentation of brain tumors is a fundamental 
medical image analysis task involving multiple MRI imaging modali- 
ties that can assist clinicians in diagnosing the patient and successively 
studying the progression of the malignant entity. In recent years, Fully 
Convolutional Neural Networks (FCNNs) approaches have become the 
de facto standard for 3D medical image segmentation. The popular “U- 
shaped” network architecture has achieved state-of-the-art performance 
benchmarks on different 2D and 3D semantic segmentation tasks and 
across various imaging modalities. However, due to the limited kernel 
size of convolution layers in FCNNs, their performance of modeling 
long-range information is sub-optimal, and this can lead to deficiencies 
in the segmentation of tumors with variable sizes. On the other hand, 
transformer models have demonstrated excellent capabilities in capturing 
such long-range information in multiple domains, including natural lan- 
guage processing and computer vision. Inspired by the success of vision 
transformers and their variants, we propose a novel segmentation model 
termed Swin UNEt TRansformers (Swin UNETR). Specifically, the task 
of 3D brain tumor semantic segmentation is reformulated as a sequence 
to sequence prediction problem wherein multi-modal input data is pro- 
jected into a 1D sequence of embedding and used as an input to a hier- 
archical Swin transformer as the encoder. The swin transformer encoder 
extracts features at five different resolutions by utilizing shifted windows 
for computing self-attention and is connected to an FCNN-based decoder 
at each resolution via skip connections. We have participated in BraTS 
2021 segmentation challenge, and our proposed model ranks among the 
top-performing approaches in the validation phase. 
Code: https://monai.io/research/swin-unetr. 


Keywords: Image segmentation - Vision transformer - Swin 
transformer - UNETR - Swin UNETR - BRATS - Brain tumor 
segmentation 
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1 Introduction 


There are over 120 types of brain tumors that affect the human brain [27]. As we 
enter the era of Artificial Intelligence (AI) for healthcare, Al-based intervention for 
diagnosis and surgical pre-assessment of tumors is at the verge of becoming a neces- 
sity rather than a luxury. Elaborate characterization of brain tumors with tech- 
niques such as volumetric analysis is useful to study their progression and assist in 
pre-surgical planning [17]. In addition to surgical applications, characterization of 
delineated tumors can be directly utilized for the prediction of life expectancy [32]. 
Brain tumor segmentation is at the forefront of all such applications. 

Brain tumors are categorized into primary and secondary tumor types. Pri- 
mary brain tumors originate from brain cells, while secondary tumors metastasize 
into the brain from other organs. The most common primary brain tumors are 
gliomas, which arise from brain glial cells and are characterized into low-grade 
(LGG) and high-grade (HGG) subtypes. High grade gliomas are an aggressive 
type of malignant brain tumors that grow rapidly and typically require surgery 
and radiotherapy and have poor survival prognosis [40]. As a reliable diagnos- 
tic tool, Magnetic Resonance Imaging (MRI) plays a vital role in monitoring 
and surgery planning for brain tumor analysis. Typically, several complimentary 
3D MRI modalities, such as T1, T1 with contrast agent (T1c), T2 and Fluid- 
attenuated Inversion Recovery (FLAIR), are required to emphasize different tis- 
sue properties and areas of tumor spread. For instance, gadolinium as the contrast 
agent emphasizes hyperactive tumor sub-regions in the T1c MRI modality [15]. 

Furthermore, automated medical image segmentation techniques [18] have 
shown prominence for providing an accurate and reproducible solution for brain 
tumor delineation. Recently, deep learning-based brain tumor segmentation tech- 
niques [19, 20,30,31] have achieved state-of-the-art performance in various bench- 
marks [2,7,34]. These advances are mainly due to the powerful feature extraction 
capabilities of Convolutional Neural Networks (CNN)s. However, the limited ker- 
nel size of CNN-based techniques restricts their capability of learning long-range 
dependencies that are critical for accurate segmentation of tumors that appear 
in various shapes and sizes. Although several efforts [10,23] have tried to address 
this limitation by increasing the receptive field of the convolutional kernels, the 
effective receptive field is still limited to local regions. 

Recently, transformer-based models have shown prominence in various 
domains such as natural language processing and computer vision [13,14,37]. In 
computer vision, Vision Transformers [14] (ViT)s have demonstrated state-of-the- 
art performance on various benchmarks. Specifically, self-attention module in ViT- 
based models allows for modeling long-range information by pairwise interaction 
between token embeddings and hence leading to more effective local and global 
contextual representations [33]. In addition, ViTs have achieved success in effec- 
tive learning of pretext tasks for self-supervised pre-training in various applica- 
tions [8,9,35]. In medical image analysis, UNETR [16] is the first methodology 
that utilizes a ViT as its encoder without relying on a CNN-based feature extrac- 
tor. Other approaches [38,39] have attempted to leverage the power of ViTs as 
a stand-alone block in their architectures which otherwise consist of CNN-based 
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components. However, UNETR has shown better performance in terms of both 
accuracy and efficiency in different medical image segmentation tasks [16]. 

Recently, Swin transformers [24,25] have been proposed as a hierarchical 
vision transformer that computes self-attention in an efficient shifted window 
partitioning scheme. As a result, Swin transformers are suitable for various down- 
stream tasks wherein the extracted multi-scale features can be leveraged for fur- 
ther processing. In this work, we propose a novel architecture termed Swin UNEt 
TRansformers (Swin UNETR), which utilizes a U-shaped network with a Swin 
transformer as the encoder and connects it to a CNN-based decoder at different 
resolutions via skip connections. We validate the effectiveness of our approach 
for the task of multi-modal 3D brain tumor segmentation in the 2021 edition of 
the Multi-modal Brain Tumor Segmentation Challenge (BraTS). Our model is 
one of the top-ranking methods in the validation phase and has demonstrated 
competitive performance in the testing phase. 


2 Related Work 


In the previous BraTS challenges, ensembles of U-Net shaped architectures have 
achieved promising results for multi-modal brain tumor segmentation. Kamnit- 
sas et al. [21] proposed a robust segmentation model by aggregating the outputs 
of various CNN-based models such as 3D U-Net [12], 3D FCN [26] and Deep 
Medic [22]. Subsequently, Myronenko et al. [30] introduced SegResNet, which 
utilizes a residual encoder-decoder architecture in which an auxiliary branch is 
used to reconstruct the input data with a variational auto-encoder as a surrogate 
task. Zhou et al. [42] proposed to use an ensemble of different CNN-based net- 
works by taking into account the multi-scale contextual information through an 
attention block. Zhou et al. [20] used a two-stage cascaded approach consisting 
of U-Net models wherein the first stage computes a coarse segmentation predic- 
tion which will be refined by the second stage. Furthermore, Isensee et al. [19] 
proposed the nnU-Net model and demonstrated that a generic U-Net architec- 
ture with minor modifications is enough to achieve competitive performance in 
multiple BraTS challenges. 

Transformer-based models have recently gained a lot of attraction in com- 
puter vision [14,24,41] and medical image analysis [11,16]. Chen et al. [11] intro- 
duced a 2D U-Net architecture that benefits from a ViT in the bottleneck of the 
network. Wang et al. [38] extended this approach for 3D brain tumor segmen- 
tation. In addition, Xie et al. [39] proposed to use a ViT-based model with 
deformable transformer layers between its CNN-based encoder and decoder by 
processing the extracted features at different resolutions. Different from these 
approaches, Hatamizadeh et al. [16] proposed the UNETR architecture in which 
a ViT-based encoder, which directly utilizes 3D input patches, is connected to a 
CNN-based decoder. UNETR has shown promising results for brain tumor seg- 
mentation using the MSD dataset [1]. Unlike the UNETR model, our proposed 
Swin UNETR architecture uses a Swin transformer encoder which extracts fea- 
ture representations at several resolutions with a shifted windowing mechanism 
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Fig. 1. Overview of the Swin UNETR architecture. The input to our model is 3D 
multi-modal MRI images with 4 channels. The Swin UNETR creates non-overlapping 
patches of the input data and uses a patch partition layer to create windows with 
a desired size for computing the self-attention. The encoded feature representations 
in the Swin transformer are fed to a CNN-decoder via skip connection at multiple 
resolutions. Final segmentation output consists of 3 output channels corresponding to 
ET, WT and TC sub-regions. 


for computing the self-attention. We demonstrate that Swin transformers [24] 
have a great capability of learning multi-scale contextual representations and 
modeling long-range dependencies in comparison to ViT-based approaches with 
fixed resolution. 


3 Swin UNETR 


3.1 Encoder 


We illustrate the architecture of Swin UNETR in Fig.1. The input to the 
Swin UNETR model ¥ € R#*W**S is a token with a patch resolution of 
(H’,W’', D’) and dimension of H’ x W’ x D' x S. We first utilize a patch parti- 
tion layer to create a sequence of 3D tokens with dimension of [ #] x [ #7] x [2] 
and project them into an embedding space with dimension C. The self-attention 
is computed into non-overlapping windows that are created in the partitioning 
stage for efficient token interaction modeling. Figure 2 shows the shifted win- 
dowing mechanism for subsequent layers. Specifically, we utilize windows of size 
w'| y [2 
M M 
a given layer / in the transformer encoder. Subsequently, in layer l+ 1, the par- 
titioned window regions are shifted by (|| ; | 4 | F | “}) voxels. In subsequent 
layers of l and l+ 1 in the encoder, the outputs are calculated as 


M x M x M to evenly partition a 3D token into Ea x | | regions at 


2! = W-MSA(LN(z!-1)) + 2/71 
z! = MLP(LN(2')) + 2! i 
2'+1 — SW-MSA(LN(z!)) + 2! (1) 
z+] = MLP(LN(2'*+4)) + 241, 
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Here, W-MSA and SW-MSA are regular and window partitioning multi-head 
self-attention modules respectively; 2 and 2'+! denote the outputs of W-MSA 
and SW-MSA; MLP and LN denote layer normalization and Multi-Layer Per- 
ceptron respectively. For efficient computation of the shifted window mechanism, 
we leverage a 3D cyclic-shifting [24] and compute self-attention according to 


Attention(Q, K, V) = Softmax (=) V. (2) 


In which Q, K, V denote queries, keys, and values respectively; d represents the 
size of the query and key. 

The Swin UNETR encoder has a patch size of 2 x 2 x 2 and a feature dimen- 
sion of 2x 2x 2x4 = 32, taking into account the multi-modal MRI images with 4 
channels. The size of the embedding space C is set to 48 in our encoder. Further- 
more, the Swin UNETR encoder has 4 stages which comprise of 2 transformer 
blocks at each stage. Hence, the total number of layers in the encoder is L = 8. 
In stage 1, a linear embedding layer is utilized to create a x ua x 2 3D tokens. 
To maintain the hierarchical structure of the encoder, a patch merging layer is 
utilized to decrease the resolution of feature representations by a factor of 2 at 
the end of each stage. In addition, a patch merging layer groups patches with 
resolution 2 x 2 x 2 and concatenates them, resulting in a 4C-dimensional feature 
embedding. The feature size of the representations are subsequently reduced to 


2C with a linear layer. Stage 2, stage 3 and stage 4, with resolutions of a x w x 2. 


H\, Wy D HY Wy D wag : ; asi 
3 X g X g and ig X Gg X 7g respectively, follow the same network design. 


3.2 Decoder 


Swin UNETR has a U-shaped network design in which the extracted feature rep- 
resentations of the encoder are used in the decoder via skip connections at each 
resolution. At each stage i (i € {0,1,2,3,4}) in the encoder and the bottleneck 
(i = 5), the output feature representations are reshaped into size pa x a x & 
and fed into a residual block comprising of two 3 x 3 x 3 convolutional layers that 
are normalized by instance normalization [36] layers. Subsequently, the resolu- 
tion of the feature maps are increased by a factor of 2 using a deconvolutional 
layer and the outputs are concatenated with the outputs of the previous stage. 
The concatenated features are then fed into another residual block as previously 
described. The final segmentation outputs are computed by using a 1 x 1 x 1 


convolutional layer and a sigmoid activation function. 
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Fig. 2. Overview of the shifted windowing mechanism. Note that 8 x 8 x 8 3D tokens 
and 4 x 4 x 4 window size are illustrated. 


Table 1. Swin UNETR configurations. 


Embed dimension | Feature size | Number of blocks | Window size | Number of heads | Parameters | FLOPs 
768 48 [2, 2, 2, 2] [77,7] [3, 6, 12, 24] 61.98M 394.84G 


3.3 Loss Function 


We use the soft Dice loss function [29] which is computed in a voxel-wise man- 


ner as 
J 


2 i1 Gig Ving 
L£(G,Y) =1 2 7 i= TA (3) 
j=l Del Gij + ei Yaa 


where J denotes voxels numbers; J is classes number; Y;,; and G; ; denote the 


probability of output and one-hot encoded ground truth for class j at voxel i, 
respectively. 


3.4 Implementation Details 


Swin UNETR is implemented using PyTorch! and MONAI? and trained on a 
DGX-1 cluster with 8 NVIDIA V100 GPUs. Table1 details the configurations 
of Swin UNETR architecture, number of parameters and FLOPs. The learning 
rate is set to 0.0008. We normalize all input images to have zero mean and 
unit standard deviation according to non-zero voxels. Random patches of 128 x 
128 x 128 were cropped from 3D image volumes during training. We apply a 
random axis mirror flip with a probability of 0.5 for all 3 axes. Additionally, we 
apply data augmentation transforms of random per channel intensity shift in the 
range (—0.1,0.1), and random scale of intensity in the range (0.9, 1.1) to input 
image channels. The batch size per GPU was set to 1. All models were trained 


1 http://pytorch.org/. 
? https://monai.io/. 
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Fig. 3. A typical segmentation example of the predicted labels whic are overlaid on 
T1, Tic, T2 and FLAIR MRI axial slices in each row. The first two rows depict ~75th 
percentile performance based on the Dice score. Rows 3 and 4 depict ~50th percentile 
performance while the last two rows are at ~25th percentile performance. The image 
intensities are on a gray color scale. The blue, red and green colors correspond to TC, 
ET and WT sub-regions respectively. Note that all samples have been selected from 
the BraTS 2021 validation set. (Color figure online) 
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for a total of 800 epochs with a linear warmup and using a cosine annealing 
learning rate scheduler. Fonr inference, we use a sliding window approach with 
an overlapping of 0.7 for neighboring voxels. 


3.5 Dataset and Model Ensembling 


The BraTS challenge aims to evaluate state-of-the-art methods for the seman- 
tic segmentation of brain tumors by providing a 3D MRI dataset with voxel- 
wise ground truth labels that are annotated by physicians [3-6,28]. The BraTS 
2021 challenge training dataset includes 1251 subjects, each with four 3D MRI 
modalities: a) native (T1) and b) post-contrast T1l-weighted (T1Gd), c) T2- 
weighted (T2), and d) T2 Fluid-attenuated Inversion Recovery (T2-FLAIR), 
which are rigidly aligned, and resampled to a 1 x 1 x 1 mm isotropic resolu- 
tion and skull-stripped. The input image size is 240 x 240 x 155. The data were 
collected from multiple institutions using various MRI scanners. Annotations 
include three tumor sub-regions: the enhancing tumor, the peritumoral edema, 
and the necrotic and non-enhancing tumor core. The annotations were com- 
bined into three nested sub-regions: Whole Tumor (WT), Tumor Core (TC), 
and Enhancing Tumor (ET). Figure 3 illustrates typical segmentation outputs 
of all semantic classes. During this challenge, two additional datasets without 
the ground truth labels were provided for validation and testing phases. These 
datasets required participants to upload the segmentation masks to the organiz- 
ers’ server for evaluations. The validation dataset, which is designed for interme- 
diate model evaluations, consists of 219 cases. Additional information regarding 
the testing dataset was not provided to participants. 

Our models were trained on BraTS 2021 dataset with 1251 and 219 cases 
in the training and validation sets, respectively. Semantic segmentation labels 
corresponding to validation cases are not publicly available, and performance 
benchmarks were obtained by making submissions to the official server of BraTS 
2021 challenge. We used five-fold cross-validation schemes with a ratio of 80:20. 
We did not use any additional data. The final result was obtained with an 
ensemble of 10 Swin UNETR models to improve the performance and achieve 
a better consensus for all predictions. The ensemble models were obtained from 
two separate five-fold cross-validation training runs. 


4 Results and Discussion 


We have compared the performance of Swin UNETR in our internal cross vali- 
dation split against the winning methologies of previous years such as SegRes- 
Net [30], nnU-Net [19] and TransBTS [38]. The latter is a ViT-based approach 
which is tailored for the semantic segmentation of brain tumors. 
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Table 2. Five-fold cross-validation benchmarks in terms of mean Dice score values. ET, 
WT and TC denote Enhancing Tumor, Whole Tumor and Tumor Core respectively. 


Swin UNETR nnU-Net SegResNet TransBTS 

Dice Score ET (WT |TC Avg ET |WT TC |Avg. ET |WT |TC |Avg. ET (WT [TC |Avg. 
Fold 1 0.876 | 0.929 | 0.914 0.906 | 0.866 | 0.921 | 0.902 | 0.896 | 0.867 | 0.924 | 0.907 | 0.899 | 0.856 | 0.910 | 0.897 | 0.883 
Fold 2 0.908 | 0.938 | 0.919 | 0.921 | 0.899 | 0.933 | 0.919 | 0.917 | 0.900 | 0.933 | 0.915 | 0.916 | 0.885 | 0.919 | 0.903 | 0.902 
Fold 3 0.891 | 0.931 | 0.919 | 0.913 | 0.886 | 0.929 | 0.914 | 0.910 | 0.884 | 0.927 | 0.917 | 0.909 | 0.866 | 0.903 | 0.898 | 0.889 
Fold 4 0.890 | 0.937 | 0.920 | 0.915 | 0.886 | 0.927 | 0.914 | 0.909 | 0.888 | 0.921 | 0.916 | 0.908 | 0.868 | 0.910 | 0.901 | 0.893 
Fold 5 0.891 | 0.934 | 0.917 | 0.914 | 0.880 | 0.929 | 0.917 | 0.909 | 0.878 | 0.930 | 0.912 | 0.906 | 0.867 | 0.915 | 0.893 | 0.892 
Avg. 0.891 | 0.933 | 0.917 | 0.913 | 0.883 0.927 0.913 | 0.908 | 0.883 | 0.927 | 0.913 | 0.907 0.868 | 0.911 | 0.898 | 0.891 


Table 3. BraTS 2021 validation dataset benchmarks in terms of mean Dice score and 
Hausdorff distance values. ET, WT and TC denote Enhancing Tumor, Whole Tumor 
and Tumor Core respectively. 


Dice Hausdorff (mm) 
Validation dataset ET WT |TC (ET |WT |TC 
Swin UNETR 0.858 0.926 0.885 6.016 | 5.831 | 3.770 


Evaluation results across all five folds are presented in Table 2. The proposed 
Swin UNETR model outperforms all competing approaches across all 5 folds 
and on average for all semantic classes (e.g. ET, WT, TC). Specifically, Swin 
UNETR outperforms the closest competing approaches by 0.7%, 0.6% and 0.4% 
for ET, WT and TC classes respectively and on average 0.5% across all classes in 
all folds. The superior performance of Swin UNETR in comparison to other top 
performing models for brain tumor segmentation is mainly due to its capability 
of learning multi-scale contextual information in its hierarchical encoder via the 
self-attention modules and effective modeling of the long-range dependencies. 

Moreover, it is observed that nnU-Net and SegResNet have competitive 
benchmarks in these experiments, with nnU-Net demonstrating a slightly better 
performance. On the other hand, TransBTS, which is a ViT-based methodology, 
performs sub-optimally in comparison to other models. The sub-optimal perfor- 
mance of TransBTS could be attributed to its inefficient architecture in which 
the ViT is only utilized in the bottleneck as a standalone attention module, and 
without any connection to the decoder in different resolutions. 

The segmentation performance of Swin UNETR in the BraTS 2021 validation 
set is presented in Table 3. According to the official challenge results’, our bench- 
marks (Team: NVOptNet) are considered as one of the top-ranking methodolo- 
gies across more than 2000 submissions during the validation phase, hence being 
the first transformer-based model to place competitively in BraTS challenges. 
In addition, the segmentation outputs of Swin UNETR for several cases in the 
validation set are illustrated in Fig. 3. Consistent with quantitative benchmarks, 
the segmentation outputs are well-delineated for all three sub-regions. 


3 https: //www.synapse.org/#!Synapse:syn25829067 /wiki/612712. 
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Table 4. BraTS 2021 testing dataset benchmarks in terms of mean Dice score and 
Hausdorff distance values. ET, WT and TC denote Enhancing Tumor, Whole Tumor 
and Tumor Core respectively. 


Dice Hausdorff (mm) 
Testing dataset ET WT |TC ET WT | TC 
Swin UNETR 0.853 0.927 | 0.876 | 16.326 | 4.739 | 15.309 


Furthermore, the segmentation performance of Swin UNETR in the BraTS 
2021 testing set is reported in Table 4. We observe that the segmentation per- 
formance of ET and WT are very similar to those of the validation benchmarks. 
However, the segmentation performance of TC is decreased by 0.9%. 


5 Conclusion 


In this paper, we introduced Swin UNETR which is a novel architecture for 
semantic segmentation of brain tumors using multi-modal MRI images. Our 
proposed model has a U-shaped network design and uses a Swin transformer 
as the encoder and CNN-based decoder that is connected to the encoder via 
skip connections at different resolutions. We have validated the effectiveness of 
our approach by in the BraTS 2021 challenge. Our model ranks among top- 
performing approaches in the validation phase and demonstrates competitive 
performance in the testing phase. We believe that Swin UNETR could be the 
foundation of a new class of transformer-based models with hierarchical encoders 
for the task of brain tumor segmentation. 


References 


1. Antonelli, M., et al.: The medical segmentation decathlon. arXiv preprint 
arXiv:2106.05735 (2021) 

2. Baid, U., et al.: The RSNA-ASNR-MICCAI brats 2021 benchmark on brain tumor 
segmentation and radiogenomic classification. arXiv preprint arXiv:2107.02314 
(2021) 

3. Bakas, S., et al.: Segmentation labels and radiomic features for the pre-operative 
scans of the TCGA-GBM collection. The Cancer Imaging Archive (2017). https:// 
doi.org/10.7937/K9/TCIA.2017.KLXWJJ1Q 

4. Bakas, S., et al.: Segmentation labels and radiomic features for the pre-operative 
scans of the TCGA-LGG collection. The Cancer Imaging Archive (2017). https:// 
doi.org/10.7937/K9/TCIA.2017.GJQ7ROEF 

5. Bakas, S., et al.: Advancing the cancer genome atlas glioma MRI collections with 
expert segmentation labels and radiomic features. Sci. Data 4, 1-13 (2017) 

6. Bakas, S., Reyes, M., et Int, Menze, B.: Identifying the best machine learning algo- 
rithms for brain tumor segmentation, progression assessment, and overall survival 
prediction in the BRATS challenge. In: arXiv:1811.02629 (2018) 


282 


10. 


11. 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


22. 


A. Hatamizadeh et al. 


. Bakas, S., et al.: Identifying the best machine learning algorithms for brain tumor 


segmentation, progression assessment, and overall survival prediction in the brats 
challenge. arXiv preprint arXiv:1811.02629 (2018) 


. Bao, H., Dong, L., Wei, F.: Beit: Bert pre-training of image transformers. arXiv 


preprint arXiv:2106.08254 (2021) 


. Caron, M., et al.: Emerging properties in self-supervised vision transformers. 


In: Proceedings of the IEEE/CVF International Conference on Computer Vision 
(2021) 

Chen, C., Liu, X., Ding, M., Zheng, J., Li, J.: 3D dilated multi-fiber network for 
real-time brain tumor segmentation in MRI. In: Shen, D., et al. (eds.) MICCAI 
2019. LNCS, vol. 11766, pp. 184-192. Springer, Cham (2019). https://doi.org/10. 
1007 /978-3-030-32248-9_ 21 

Chen, J., et al.: Transunet: transformers make strong encoders for medical image 
segmentation. arXiv preprint arXiv:2102.04306 (2021) 

Çiçek, O., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3D U-Net: 
learning dense volumetric segmentation from sparse annotation. In: Ourselin, S., 
Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, 
vol. 9901, pp. 424-432. Springer, Cham (2016). https://doi.org/10.1007/978-3-319- 
46723-8_49 

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirec- 
tional transformers for language understanding. arXiv preprint arXiv:1810.04805 
(2018) 

Dosovitskiy, A., et al.: An image is worth 16x16 words: Transformers for image 
recognition at scale. In: International Conference on Learning Representations 
(2020) 

Grover, V.P., Tognarelli, J.M., Crossey, M.M., Cox, I.J., Taylor-Robinson, S.D., 
McPhail, M.J.: Magnetic resonance imaging: principles and techniques: lessons for 
clinicians. J. Clin. Exp. Hepatol. 5(3), 246-255 (2015) 

Hatamizadeh, A., et al.: UNETR: transformers for 3d medical image segmentation. 
arXiv preprint arXiv:2103.10504 (2021) 

Hoover, J.M., Morris, J.M., Meyer, F.B.: Use of preoperative magnetic resonance 
imaging tl and t2 sequences to determine intraoperative meningioma consistency. 
Surg. Neurol. Int. 2, 142 (2011) 

Huo, Y., et al.: 3D whole brain segmentation using spatially localized atlas network 
tiles. Neuroimage 194, 105-119 (2019) 

Isensee, F., Jager, P.F., Full, P.M., Vollmuth, P., Maier-Hein, K.H.: nnU-Net for 
brain tumor segmentation. In: Crimi, A., Bakas, S. (eds.) BrainLes 2020. LNCS, 
vol. 12659, pp. 118-132. Springer, Cham (2021). https: //doi.org/10.1007/978-3- 
030-72087-2_11 

Jiang, Z., Ding, C., Liu, M., Tao, D.: Two-stage cascaded U-Net: 1st place solution 
to BraTS challenge 2019 segmentation task. In: Crimi, A., Bakas, S. (eds.) BrainLes 
2019. LNCS, vol. 11992, pp. 231-241. Springer, Cham (2020). https://doi.org/10. 
1007/978-3-030-46640-4_22 

Kamnitsas, K., et al.: Ensembles of multiple models and architectures for robust 
brain tumour segmentation. In: Crimi, A., Bakas, S., Kuijf, H., Menze, B., Reyes, 
M. (eds.) BrainLes 2017. LNCS, vol. 10670, pp. 450-462. Springer, Cham (2018). 
https: //doi.org/10.1007/978-3-319-75238-9_38 

Kamnitsas, K., et al.: Efficient multi-scale 3D CNN with fully connected CRF for 
accurate brain lesion segmentation. Med. Image Anal. 36, 61-78 (2017) 


23. 


24. 


25. 
26. 


Oks 


28. 


29. 


30. 


31. 


32. 


33. 


34. 


35. 


36. 


37. 


38. 


39. 


40. 


Swin Transformers for Semantic Segmentation of Brain Tumors 283 


Liu, D., Zhang, H., Zhao, M., Yu, X., Yao, S., Zhou, W.: Brain tumor segmen- 
tion based on dilated convolution refine networks. In: 2018 IEEE 16th Interna- 
tional Conference on Software Engineering Research, Management and Applica- 
tions (SERA), pp. 113-120. IEEE (2018) 

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin 
transformer: Hierarchical vision transformer using shifted windows. In: Proceedings 
of the IEEE/CVF International Conference on Computer Vision (2021) 

Liu, Z.,et al.: Video swin transformer. arXiv preprint arXiv:2106.13230 (2021) 
Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic 
segmentation. In: Proceedings of the IEEE Conference on Computer Vision and 
Pattern Recognition, pp. 3431-3440 (2015) 

Louis, D.N., et al.: The 2007 who classification of tumours of the central nervous 
system. Acta Neuropathol. 114(2), 97-109 (2007) 

Menze, B.H., et al.: The multimodal brain tumor image segmentation benchmark 
(brats). IEEE Trans. Med. Imaging 34(10), 1993-2024 (2015) 

Milletari, F., Navab, N., Ahmadi, S.A.: V-net: fully convolutional neural networks 
for volumetric medical image segmentation. In: 2016 Fourth International Confer- 
ence on 3D Vision (3DV), pp. 565-571. IEEE (2016) 

Myronenko, A.: 3D MRI brain tumor segmentation using autoencoder regular- 
ization. In: Crimi, A., Bakas, S., Kuijf, H., Keyvan, F., Reyes, M., van Walsum, 
T. (eds.) BrainLes 2018. LNCS, vol. 11384, pp. 311-320. Springer, Cham (2019). 
https: //doi.org/10.1007/978-3-030-11726-9_28 

Myronenko, A., Hatamizadeh, A.: Robust semantic segmentation of brain tumor 
regions from 3D MRIs. In: Crimi, A., Bakas, S. (eds.) BrainLes 2019. LNCS, 
vol. 11993, pp. 82-89. Springer, Cham (2020). https://doi.org/10.1007/978-3-030- 
46643-5_8 

Nie, D., Zhang, H., Adeli, E., Liu, L., Shen, D.: 3D deep learning for multi-modal 
imaging-guided survival time prediction of brain tumor patients. In: Ourselin, S., 
Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) MICCAI 2016. LNCS, 
vol. 9901, pp. 212-220. Springer, Cham (2016). https://doi.org/10.1007/978-3-319- 
46723-8_25 

Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision 
transformers see like convolutional neural networks? Adv. Neural. Inf. Process. 
Syst. 34, 12116-12128 (2021) 

Simpson, A.L., et al.: A large annotated medical image dataset for the development 
and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063 (2019) 
Tang, Y., et al.: Self-supervised pre-training of swin transformers for 3D medical 
image analysis. arXiv preprint arXiv:2111.14791 (2021) 

Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: the missing ingre- 
dient for fast stylization. arXiv preprint arXiv:1607.08022 (2016) 

Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information 
Processing Systems, pp. 5998-6008 (2017) 

Wang, W., Chen, C., Ding, M., Yu, H., Zha, S., Li, J.: TransBTS: multimodal brain 
tumor segmentation using transformer. In: de Bruijne, M., et al. (eds.) MICCAI 
2021. LNCS, vol. 12901, pp. 109-119. Springer, Cham (2021). https://doi.org/10. 
1007 /978-3-030-87193-2.11 

Xie, Y., Zhang, J., Shen, C., Xia, Y.: COTR: efficiently bridging CNN and trans- 
former for 3D medical image segmentation. arXiv preprint arXiv:2103.03024 (2021) 
Zacharaki, E.I., et al.: Classification of brain tumor type and grade using MRI 
texture and shape in a machine learning scheme. Magnetic Resonance Med. Off. 
J. Int. Soc. Magnetic Resonan. Med. 62(6), 1609-1618 (2009) 


284 


41. 


42. 


A. Hatamizadeh et al. 


Zheng, S., et al.: Rethinking semantic segmentation from a sequence-to-sequence 
perspective with transformers. In: Proceedings of the IEEE/CVF Conference on 
Computer Vision and Pattern Recognition, pp. 6881-6890 (2021) 

Zhou, C., Chen, S., Ding, C., Tao, D.: Learning contextual and attentive informa- 
tion for brain tumor segmentation. In: Crimi, A., Bakas, S., Kuijf, H., Keyvan, F., 
Reyes, M., van Walsum, T. (eds.) BrainLes 2018. LNCS, vol. 11384, pp. 497-507. 
Springer, Cham (2019). https://doi.org/10.1007/978-3-030-11726-9_44 


S 


Check for 
updates 


Multi-plane UNet++ Ensemble 
for Glioblastoma Segmentation 


Johannes Roth!, Johannes Keller”, Stefan Franke”, Thomas Neumuth?, 
and Daniel Schneider?“ 


1 Center for Scalable Data Analytics and Artificial Intelligence (ScaDS.AI), 
University of Leipzig, Leipzig, Germany 
? Innovation Center Computer Assisted Surgery (ICCAS), University of Leipzig, 
Leipzig, Germany 
daniel.schneider@uni-leipzig.de 


Abstract. Glioblastoma multiforme (grade four glioma, GBM) is the 
most aggressive malignant tumor in the brain and usually treated by 
combined surgery, chemo- and radiotherapy. The O-6-methylguanine- 
DNA methyltransferase (MGMT) promoter methylation status was 
shown to be predictive of GBM sensitivity to alkylating agent chemother- 
apy and is a promising marker for personalized treatment. In this paper 
we propose to use a multi-plane ensemble of UNet++ models for the 
segmentation of gliomas in MRI scans, using a combination of Dice loss 
and boundary loss for training. For the prediction of MGMT promoter 
methylation, we use an ensemble of 3D EfficientNet (one per MRI modal- 
ity). Both, the UNet++ ensemble and EfficientNet are trained and val- 
idated on data provided in the context of the Brain Tumor Segmenta- 
tion Challenge (BraTS) 2021, containing 2.000 fully annotated glioma 
samples with four different MRI modalities. We achieve Dice scores of 
0.792, 0.835, and 0.906 as well as Hausdorff distances of 16.61, 10.11, and 
4.54 for enhancing tumor, tumor core and whole tumor, respectively. For 
MGMT promoter methylation status prediction, an AUROC of 0.577 is 
obtained. 


Keywords: Medical image segmentation - Ensemble learning - 
Glioma - MGMT promoter methylation 


1 Introduction 


Gliomas comprise roughly 80% of all malignant brain tumors [7]. Particularly the 
grade four glioma, referred to as glioblastoma multiforme (GBM), indicates poor 
medical prognosis. GBM are usually treated with combined surgery, radiotherapy 
and chemotherapy. Treatment is often complicated by the strong morphological 
and histological heterogeneity of gliomas, consisting of distinct regions such as 
active tumor, cystic and necrotic structures, and edema/invasion. Automated 
and accurate methods for semantic segmentation of gliomas from multiparamet- 
ric magnetic resonance imaging (mpMRI) scans are critical to diagnosis and 
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therapy. In recent years, genomic studies identified molecular glioma subtypes 
exhibiting correlation to prognosis and treatment response. So it was shown that 
the O-6-methylguanine-DNA methyltransferase (MGMT) promoter methylation 
status is predictive of GBM sensitivity to alkylating agent chemotherapy [18]. 
Molecular genetic markers may lead to more specialized and personalized treat- 
ment of glioma patients. The field of radiomics aims to predict similar disease 
characteristics via automated feature extraction from medical images. 

In the context of the brain tumor segmentation challenge (BraTS), a large- 
scale mpMRI dataset of patients with glioma is provided annually to evaluate 
state-of-the-art methods for automatic tumor segmentation and classification [1— 
4,13]. Specifically, the challenge consists of two tasks - the accurate segmentation 
of gliomas into the three subregions enhancing tumor, tumor core and whole 
tumor and the prediction of the MGMT promoter methylation marker from 
mpMRI scans. 

In this paper, we present image processing pipelines for both the segmenta- 
tion and classification task. For segmentation, we propose a multi-plane UNet++ 
[19] ensemble with a combination of Dice and boundary loss for accurate tumor 
border prediction. Fully convolutional networks such as UNet [14] are the current 
method of choice for medical image segmentation. Their hierarchical encoder- 
decoder structure captures spatial context in the input images and produces 
high resolution segmentation masks. The UNet++ considered in this work uses 
nested dense skip pathways instead of the vanilla skip connections, increasing 
semantic similarity between the encoder and decoder feature maps. Aggregating 
the output of ensembles of deep neural networks is a common technique shown to 
increase performance in various prediction tasks [8]. For classification, we use a 
3D EfficientNet [16] ensemble consisting of four models - one per MRI modality. 
The EfficientNet architecture enables the training of lightweight classification 
models without loss in performance compared to larger models such as ResNet 
[9]. 

The work at hand is structured as follows: Sect. 2 describes the dataset used 
for training and validation, as well as the details of our model ensembles and 
training procedures. In Sect. 3 we present the preliminary results of our methods 
on the provided test datasets. Ultimately, in Sect.4 we draw conclusions and 
provide ideas for future research. 


2 Methods 


2.1 Data 


The data used for training and validation of the models presented in this paper 
is provided by the BraTS Challenge 2021 [1]. The data for the segmentation 
task contains 2.000 GBM cases, each providing four MRI modalities - T1- 
weighted (T1), post gadolinium T1-weighted (T1-Gd), T2-weighted (T2) and T2- 
weighted-fluid-attenuated inversion recovery (T2-FLAIR) (see Fig. 1 for example 
slices). 
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T1Gd 


T2 FLAIR 


Fig. 1. Example scans of the four MRI modalities T1, T1-Gd, T2 and T2-FLAIR (from 
left to right) provided by the BRATS data. 


The image data was preprocessed by co-registration to the same anatomical 
template, skull stripping and interpolation to a resolution of 1 mm. Each GBM 
sample was manually annotated by up to four raters. The annotations include 
the Gd-enhancing tumor (ET), the peritumoral edematous/invaded tissue (ED), 
and the necrotic tumor core (NCR). The union of ET and NCR is called tumor 
core (TC). Last, the whole tumor (WT) is comprised of the TC and ED regions. 
The classification task provides largely the same cases and the corresponding 
MGMT promoter methylation information, but without any preprocessing or 
information about the location of the tumor. 


2.2 Brain Glioblastoma Segmentation 


Axial slices 


UNet++ (axial) 


Segmentation 


Sagital sli 


MRI scans 


Majority vote 
UNet++ (sagital) 


Coronal sli 


UNet++ (coronal) 


Fig. 2. Overview of the proposed segmentation pipeline. Each mpMRI scan is sliced 
to create axial, sagital and coronal 2D images. All four modalities are concatenated 
and passed through the respective UNet++ model, producing a segmentation for each 
slice. The resulting segmentation maps are then concatenated back into cubes, which 
are then aggregated by a majority vote. 


An overview of our segmentation pipeline is shown in Fig. 2. We trained an indi- 
vidual UNet++ [19] on 2D MRI slices in each anatomical plane (axial, sagital, 
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and coronal) to predict all four possible output classes. Similar to the vanilla 
UNet architecture, UNet++ consists of a pathway through hierarchical encoder 
and decoder subsections additionally linked by skip connections to retain infor- 
mation at higher spatial resolutions. To reduce the semantic gap between encoder 
and decoder, the skipping feature maps are gradually enriched by incorporating 
information from deeper layers through a number of nested convolutional blocks 
(see Fig. 3). 


-T Up-sampling 
“xy. Down-sampling 
--» Skip connection 
Xi,j Convolution 


Fig. 3. UNet++ architecture, featuring the typical UNet encoder and decoder path- 
way and a series of nested dense skip connections, adapted from [19]. In this work, 
supervision is only carried out on the output of the final segmentation head. 


For segmentation of a complete MRI scan, the image data is sliced and passed 
through the associated model. The resulting 2D segmentation maps are then 
concatenated back together and aggregated over the three models by a majority 
vote. As the backbone encoder for our UNet++ models, a Xception model [5] 
is used, taking all four MRI modalities as input. The encoder output is then 
passed through a UNet++ decoder with a softmax segmentation head. 

Our decoder consists of five final stages X+!—X°° and multiple correspond- 
ing intermediate stages. Each stage consists of two convolutional blocks, made 
up by a convolutional layer with kernel size 3, followed by a batch normalization 
layer and a ReLU activation. The number of feature maps of the convolutional 
layers in each stage j € Ny 5) is 2/T%, ie. 16, 32, 64, 128, and 256, respec- 
tively. For a forward pass through X*/, the logits of X‘t!J—1!, upsampled via 
nearest neighbour interpolation when necessary and concatenated with the log- 
its of intermediate X**",k € N (0,7), are used as input. Segmentation masks are 
obtained by passing the logits of X°° through a softmax layer. 
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The model was trained to minimize the following loss function; 
L= (1 = a) L Dice + aL Boundary: 


where Lpice is the dice loss, Lgoundary refers to the boundary loss [11], and a 
is a weight parameter. The binary dice loss is defined as 


2 icx YiPi 
ie x (Yi TF Di) 


where 7 refers to the index over pixels provided X is the set of all pixels in a slice, 
while y; and p; indicate the corresponding true class label and the predicted class 
softmax output, respectively. The boundary loss is used to improve segmenta- 
tion accuracy at the periphery of the different tumor regions. For binary class 
segmentation it is defined as 


LDice =1 


L Boundary = 5 pifs(i) 


iEX 


with fs(i) : X — R being a pre-computed level set function encoding the 
Euclidean distance of i to the boundary of the compact region S of the pos- 
itive target class. Equation (2.2) becomes minimal, when the boundaries of the 
ground truth and prediction region are aligned (for more details, see [11]). To 
obtain the total multi-class loss, as necessary for the segmentation task, we use 
the macro average of the losses for each tumor class in a one-versus-all man- 
ner. We slowly shift the total loss towards the boundary loss by initializing a 
with 0.01 and then linearly increasing by Aa = 0.01 each epoch. During pre- 
processing, we resized every slice to 256 x 256 pixels and used random image 
transformations such as flipping, rotations, as well as Gaussian or Poisson noise 
(with u = 0,0 = 0.2 for both) in order to mitigate overfitting. We trained our 
models using the Adam optimizer [12] with a learning rate of 1e — 4 and betas 
(0.9,0.999) for 50 epochs, using a batch size of 16. During training, the learning 
rate is reduced by a factor of 0.1 whenever the validation loss stopped decreasing 
for more than two epochs. For this purpose, a hold out validation set comprising 
20% of the available training data was used. Ultimately, the model with the best 
validation score was used for inference. 


2.3 Prediction of MGMT Promoter Methylation Status 


For the prediction of the MGMT promoter methylation marker a classifier ensem- 
ble was used (see Fig. 4 for a schematic representation of the pipeline). 

Since for this task the mpMRI data was not co-registered, we used each 
modality independently to train a corresponding 3D EfficientNet [16]. The mod- 
els architecture followed the 2D EfficientNet-BO architecture with a width and 
depth coefficient of 1.0 and a dropout probability of 0.2, but with 3D convolu- 
tions to enable processing of complete MRI scans (see Table 1). 
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——}>| 3D EfficientNet 
T1 MRI 


——}>| 3D EfficientNet 
T1-Gd MRI 


Average 


| _______» Classification 


——}>| 3D EfficientNet 
T2 MRI 


——}>| 3D EfficientNet 
T2-FLAIR MRI 


Oo Vy 


Fig. 4. Overview of the proposed classification pipeline. Each MRI modality scan is 
passed into a 3D EfficientNet and scores are averaged to obtain the final class predic- 
tion. 


Table 1. EfficientNet-BO architecture: Information flows through successive convolu- 
tional stages in a feed-forward manner (increasing the level of feature abstraction while 
reducing spatial resolution), then is aggregated via 1 x 1 convolution and pooling, and 
finally processed by a fully connected classifier. Each row describes a stage in the net- 
work with the number of layers and output channels. MBConv refers to the mobile 


inverted bottleneck block from [15]). 
Stage | Operator ##Channels | ##Layers 
1 Conv, 3x 3 32 1 
2 MBConvl, 3 x 3 16 1 
3 MBConv6, 3 x 3 24 2 
4 MBConv6, 5 x 5 40 2 
5 MBConv6, 3 x 3 80 3 
6 MBConv6, 5 x 5 112 3 
7 MBConv6, 5 x 5 192 4 
8 MBConv6, 3 x 3 320 1 
9 Conv, 1 x 1 & Pooling & FC | 1280 1 


For inference, we averaged over the class scores of all trained models. The 
models are optimized using binary cross entropy loss 


Lace =- a -log(p) + (1 — y:) -log(1 — p)], 


w=1 


where y; € {0,1} denotes the class label of sample i and p; refers to the pre- 
dicted output probability of the positive class. To obtain those class pseudo- 
probabilities, we used a sigmoid on the model logits. Similarly to the segmenta- 
tion task, the models were trained using Adam optimizer with a learning rate of 
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0.001 and betas (0.9, 0.999) for 10 epochs. During training, 20% of the available 
data was held out for validation. For data augmentation, random rotations were 
applied to the MRI scan volumes. 


3 Results 


The final leaderboard evaluation was carried out with a hidden test set provided 
by the BraTS 2021 challenge. Using the UNet++ ensemble for glioma segmen- 
tation we achieved average dice scores of 0.792, 0.835 and 0.906 and average 


Table 2. Preliminary evaluation scores for the UNet++ ensemble for enhancing tumor 
(ET), tumor core (TC) and whole tumor (WT). The scores were computed on a hidden 
test set provided by the BraTS 2021 challenge. 


Metric ET TC WT 

Dice 0.79185 | 0.83494 | 0.90638 
95% Hausdorff distance | 16.60631 | 10.11467 | 4.53913 
Sensitivity 0.79693 | 0.80568 | 0.88333 
Specificity 0.99975 | 0.99984 | 0.99948 


Ground truth Predicted segmentations False positives False negatives 


Fig. 5. Glioma segmentation example slices obtained using the UNet++ ensemble. The 
columns from left to right show the ground truth segmentations, predicted segmenta- 
tions, as well as the false positive and the false negative regions for Gd-enhancing tumor 
(ET, blue), the peritumoral edematous/invaded tissue (ED, green), and the necrotic 
tumor core (NCR, yellow). (Color figure online) 
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95%-Hausdorff distances of 16.606, 10.115 and 4.549 for enhancing tumor, tumor 
core and whole tumor, respectively. Table 2 shows all evaluation scores, averaged 
over all samples in the hidden test set. Figure5 features a few example anno- 
tated MRI slices for qualitative assessment, while Fig.6 shows exemplary pre- 
diction failures. The MGMT promoter methylation classification model achieved 
an AUROC score of 0.577 on the hidden test set. 


Ground truth Predicted segmentations False positives False negatives 


Fig. 6. Examples of low accuracy predictions during segmentation obtained using the 
UNet++ ensemble. 


4 Discussion and Conclusion 


This work presented our solutions to the tumor segmentation and classification 
tasks of the BraTS 2021 challenge. For segmentation, we used an ensemble con- 
sisting of three UNet++ models - one per anatomical plane - for the task of 
segmenting GBM and their subregions in the human brain. The combination of 
ensemble majority voting and training with boundary loss achieved fairly good 
performance on the test data and thus turned out to be a valid approach for 
automatic segmentation of GBM. For the classification task - specifically, MGMT 
promoter methylation marker prediction - we chose to omit any preprocessing 
or feature extraction. Instead we use an ensemble of 3D EfficientNets on the 
raw MRI image series - one EfficientNet per MRI modality. Here, our approach 
resulted in acceptable results. 
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Various means of improving the performance of our methods in both BraTS 
tasks exist. The segmentation ensemble used could be expanded or its models 
replaced with other FCN architectures, potentially including techniques such 
as feature map competition or attention mechanisms. Deep supervision may be 
used with the UNet++ models. Additionally, we plan to use boundary refinement 
with a post-processing pipeline similar to BPR [17]. The observed segmentation 
failures may suggest decreased performance for outlier cases. Here, advanced 
techniques to improve generalization such as adversial learning schemes may be 
beneficial. For our approach to the classification task, segmentation of the tumor 
regions and cropping during preprocessing is planned in future work. Also reg- 
istration of the mpMRI modalities and subsequent use of multichannel input 
to the EfficentNet might be of benefit. Most importantly, virtually no hyperpa- 
rameter tuning was conducted in this work and would potentially promote the 
performance of both segmentation and classification task. 

For adoption in clinical practice, automated methods should be able to pro- 
vide realistic estimates of their (un)certainty. While this work focuses on point 
estimation, the presented methods may be further extended by recent techniques 
for epistemic uncertainty estimation such as hypermodels [6] or ensemble knowl- 
edge distillation [10] as well as uncertainty calibration methods. 
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Abstract. Segmentation of brain tumor is challenging due presence of 
healthy or background region more compared to tumor regions and also 
the tumor region itself divided in edema, tumor core and non enhanc- 
ing regions makes it hard to segment. Given the scarcity of such data, it 
becomes more challenging. In this paper, we built a 3D-UNet based archi- 
tecture for multimodal brain tumor segmentation task. We have reported 
results on BraTS 2021 Validation and Test Dataset. We achieved a Dice 
value of 0.87, 0.76 and 0.73 on whole tumor region, tumor core region and 
enhancing part respectively for Validation Data and 0.73, 0.67 and 0.63 
on whole tumor region, tumor core region and enhancing part respec- 
tively for Test Data. 


Keywords: Convolutional Neural Network - Unet - Brain Tumor - 
Segmentation - Magnetic Resonance Imaging 


1 Introduction 


Gliomas is a very common type of tumor develops in the brain. In the brain 
tumors about 33% contains gliomas, originates from glial cells that covers and 
support neurons in the brain. It is being classified as either High Grade Gliomas 
(HGG) or Low Grade Gliomas (LGG). High Grade Gliomas is more aggressive 
growth leading to death. The tumor region itself comprises of sub-regions of Gd- 
enhancing tumor, the peritumoral edematous/invaded tissue, and the necrotic 
tumor core. 

The automated method must have sense of depiction of tumor region and 
differentiate it from healthy tissue regions. However, due to high variance in 
tumor regions in terms intensity, texture, appearance, location, etc. one need to 
be careful while doing segmentation with incorporating these challenges [1,2]. 

Clinically, multiple image volumes are being acquired for the brain. Each 
image is corresponds to a sequence. In general, there is a 4 image sequence is 
obtained T1, T1-contrast enhanced, T2, FLAIR because certain components of 
tumor regions are clearly visible in certain image sequences. 

The tumor constitutes of Edema which constitutes of fluid and water, can 
be best seen in FLAIR, T2 modality. Necrosis (NCR) which is accumulation 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 
A. Crimi and S. Bakas (Eds.): BrainLes 2021, LNCS 12962, pp. 295-305, 2022. 
https://doi.org/10.1007/978-3-031-08999-2_24 


296 G. Singh and A. Phophalia 


of dead cells this can be best seen T1 contrast enhanced. Enhancing Tumor 
indicated breakdown of blood brain barriers which can seen clearly T1lce. There 
are different modality of brain scans with varies in different intensity. 

The MICCAI BraTS Challenge have seen many methods in recent years and 
it aims to give accurate segmentation of tumors [3]. UNet [4] based architec- 
ture has been used as one of the successful architecture with having accurate 
results for tumor segmentation. The best performing methods have used UNets 
(an encoder-decoder framework) as their segmentation achitecture [5-9]. Some 
methods tried to levearge advantage of 3D and 2D based architecture through 
triplanar ensembles of CNNs [10]. 

In this paper, we have built a 3D-Unet [4] based model for brain tumor 
segmentation task by leveraging more contextual information while decoding 
via bottleneck layer at each encoder’s block output. 


2 Methods 


2.1 Data Pre-processing and Augmentation 


In this work, original size for every patient’s MRI images was 240 x 240 x 155 
with 4 modality (Flair, T1, T1Gd, T2). We have removed some background pixel 
from each dimension of MRI image and reduced it to size of 160 x 192 x 128 
to have portion around its center, considering that only brain tissue will 
be extracted. The intensity normalization step is applied to each modality 
while keeping background as 0. We extracted random patch of patch size of 
128 x 128 x 128 from every patient’s MRI images after combining each modality 
as channel [11]. 

From the work of [12], we used elastic deformation with square deforma- 
tion grid with displacements sampled from a normal distribution with standard 
deviation 2 voxels with probability being 0.75. 


2.2 Model Architecture 


UNet [4] is being one of the successful model in medical domain in terms of archi- 
tecture. It does image segmentation based on pixels produced by convolutions 
layers of the neural network. 

In this work, we have built a 3D-UNet based architecture having residual 
connections [13] in it and have some modification on Vox2Vox’s Generator [11]. 
It does by concatenating previous block output with current block output in 
bottleneck layer (forces model to only contain the useful information to be able 
to reconstruct the segmentation map) and have passed each encoder output 
(Downward) through one another Conv Block (Horizontall) as can be seen in 
Fig. 1. This helps to refine the encoder each block’s output to produce more 
accurate segmentation. 

Our model takes input of 3D volume having channels as Flair, T1, Tice, T2 
makes it size of 128 x 128 x 128 x 4. It outputs the same size of input having pre- 
dicted segmentation, where each channel corresponds to one of the labels(NCR, 
ED, ET, everything else). 
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Our model consists of following blocks: 


Downward: four down sampling block, each followed by Conv3D with kernel 
size 4 x 4 x 4, strides 2 after that Instance Norm is applied and LeakyReLU with 
negative slope is 0.3. 

Horziontal1: four horizontal block, each followed by Conv3D with kernel size 
4x 4x4, strides 1 and padding same after that Instance Norm is applied and 
LeakyReLU with negative slope is 0.3. 

Horziontal2: three horizontal block, each followed by Conv3D with kernel 
size 4x 4x 4, strides 1 and padding same after that Instance Norm is applied, 
dropout with 0.2 is added and LeakyReLU with negative slope is 0.3. Each of 
its input in this block is concatenation of current input and its previous output 
from horizontal layer. 

Upward1: three upward block, each followed by ConvTranspose3D with kernel 
size 4 x 4x 4, strides 2 after that Instance Norm is applied, and LeakyReLU with 
negative slope is 0.3. Each of them is concatented to corresponding ‘Horizontal2’ 
block layer. 

Upward2: one output block followed by ConvTranspose3D with kernel size 
4x 4x 4, strides 2 after that Instance Norm is applied, and Softmax. 


2.3 Training 


For brain tumor segmentation, we have trained the network for 25 epochs. We 
used Adam optimizer (combines adaptive learning rate and gradient descent 
with momentum property) for our network with the learning rate being 0.00005, 
3, = 0.9 and Bz = 0.999. 

For the loss, we used Generalized Dice loss [14] which helps to deal with 
the class imbalanced situation that always occurs in brain tumor segmentation 
task where background region dominating over tumor regions, this loss helps us 
to comes out from this situation by penalizing less to network with the major- 
ity class with lower weight and penalizing high for minority classes with high 
weights. The weights of each class is given by the inverse of its volume. All the 
experiments are being conducted on Google Colab Pro using an 16 GB NVIDIA 
P100 GPU with 13.6GB RAM. 

The entire network is being trained from scratch and do not use other train- 
ing data other than BraTS 2021 Dataset [3, 15-18]. It took around 48 hrs to 
completely train the network. We trained with batch size of 8 and validated 
with batch size of 4. For the training purpose, we randomly split BraTS training 
dataset into two parts 85% for training set and 15% for evaluation set for our 
own experiments to validate model performance. 

During Inference time, we can take 160 x 192 x 128 patch size from each 
modality after removing some background pixels from each dimension and mostly 
around its center portion because convolution operation is being not affected hav- 
ing different size of input in this case and then applied intensity normalization. 
Then it passes input through the network and get the predicted segmentation of 


298 G. Singh and A. Phophalia 


J 


64x64x64x64 


T 


16x16x16x256 16x16x16x256 16x16x16x256 


o-oo ——_ 


8x8x8x8x512 8x8x8x8x512 8x8x8x8x512 


06i EË 


L Conv3D (k=4,s=2) Instance Norm LeakyReLU(0.3) [Downward] 


T ConvTranspose3D (k=4,s=2) Instance Norm LeakyReLU(0.3) [Upward1] 


TT ConvTranspose3D (k=4,s=2) Instance Norm Softmax [Upward2] 
=) Conv3D (k=4,s=1) Instance Norm LeakyReLU(0.3) [Horizontal 1] 


=] Conv3D (k=4,s=1) Instance Norm Dropout(0.2) LeakyReLU(0.3) [Horizontal2] 


> Concatenation k - kernel size, s - stride 


Fig. 1. Our Model Architecture 
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the same size of input (160 x 192 x 128 x 4) and then we padded it zero so that 
it can have size of 240 x 240 x 155. 


3 Results 


3.1 Dataset 


We have used BraTS 2021 dataset in this work. Each patient has 4 modalities, 
namely, i) T1 and ii) contrast-enhanced T1-weighted (T1-Gd), iii) T2, and iv) 
Fluid Attenuated Inversion Recovery (FLAIR), and associated ground truth 
label, each of size 240 x 240 x 155. 

The sub-regions are: i) the “Enhancing Tumor” (ET), ii) the “Necrotic Tumor 
Core” (TC), and iii) the “Whole Tumor” (WT). The segmented class labels 
are: 1 for NCR, 2 for ED (Edema), 4 for ET, and 0 for everything else. All 
input scans are rigidly registered to the same anatomical atlas using the Greedy 
diffeomorphic registration algorithm [19], ensuring a common spatial resolution 
of (1mm?). We have 1251 samples in training set, 219 samples in validation set 
and 570 samples in testing set for our experiment [3, 15-18]. 


3.2 Performance Analysis 


We have reported results based on below metric as provided by BraTS Chal- 
lenge: 


2TP 
i) Dice S = 
(i) Dice Score FNL FPI TP 
TP 
ii) S itivity = 
(ii) Sensitivity FN LTP’ 
TN 
it) Specificity < 
(iii) Speci ficity FPLTN’ 


and (iv) 95+” percentile of Hausdorff Distance (H95) 


where FP, FN, TP, and TN are number of false positive, false negative, true 
positive and true negative voxels respectively. 

Figure 2 shows comparison of Ground Truth Segmentation (Top) and Pred- 
cited Segmentation from our model (Bottom) on specific slice of MRI Image. 
Figure 3 shows Predicted Segmentation from our model. Figure 4 shows Box 
Plot of Dice Coefficient score on each tumor regions. Figure 5 shows histogram 
of whole tumor region on BraTS Validation Data. Figure 6 shows Robust Haus- 
dorff Distance on BraTS Validation Data. Figure 7 shows histogram Sensitivity 
on BraTS Validation Data. Figure 8 shows Specificity on BraT'S Validation Data. 

The mean value of Dice coefficient score for whole tumor, tumor core and 
enhance tumor regions are 0.87, 0.76 and 0.73 respectively as can be seen in 
Table 1 on Validation Data. Sensitivity and Specificity of each tumor region for 
Validation Data shows respectively in Table 2. The mean value of Dice coefficient 
score for whole tumor, tumor core and enhance tumor regions are 0.73, 0.67 
and 0.63 respectively as can be seen in Table3 on Test Data. Sensitivity and 
Specificity of each tumor region for Test Data shows respectively in Table 4. 
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tes 


Fig. 2. Ground Truth segmentation (Top) and Predicted segmentation (Bottom) Left- 
to-Right: BraTS2021_00000 patient ID on slice 80 BraTS2021_00003 patient ID on slice 
100, BraT$2021_00045 patient ID on slice 50, BraT'S2021_00046 patient ID on slice 100. 


Fig. 3. Predicted segmentation on BraTS 2021 Validation Dataset Left-to-Right: 
BraT$2021_00001 patient ID on slice 70 BraTS2021_00013 patient ID on slice 70, 
BraT$2021_00015 patient ID on slice 90, BraTS2021_00027 patient ID on slice 80. 
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Fig. 4. The dice coefficient on BraTS 2021 validation data. 
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Fig. 5. The performance graph of dice coefficient on whole tumor region on BraTS 
2021 validation data. 
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Fig. 6. Hausdorff distance on BraTS 2021 validation data. 
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Fig. 7. Sensitivity on BraTS 2021 validation data. 
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Fig. 8. Specificity on BraTS 2021 validation data. 


Table 1. Dice Coefficient and Hausdorff distance on validation data 


Label Dice_ET | Dice_WT | Dice_TC | H95_ET | H95_WT | H95_TC 
Mean 0.73 0.87 0.76 30.50 6.29 14.70 
StdDev 0.27 0.09 0.28 93.57 10.29 50.74 
Median 0.83 0.90 0.89 2.23 3.31 3.16 
25quantile | 0.73 0.85 0.74 1.41 2.23 1.73 
75quantile | 0.89 0.93 0.93 5.0 6.0 8.77 
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Table 2. Sensitivity and specificity on validation data 


Label Sens_ET | Sens_WT | Sens_TC | Spec_ET | Spec_WT | Spec_TC 
Mean 0.70 0.86 0.72 0.99 0.99 0.99 
StdDev 0.26 0.12 0.28 0.0003 | 0.0008 0.0003 
Median 0.80 0.90 0.85 0.99 0.99 0.99 
25quantile | 0.66 0.81 0.65 0.99 0.99 0.99 
75quantile | 0.87 0.95 0.91 0.99 0.99 0.99 


Table 3. Dice Coefficient and Hausdorff distance on test data 


Label Dice_ET | Dice-WT | Dice_TC | H95_ET | H95_WT | H95_TC 
Mean 0.65 0.73 0.69 72.69 63.26 72.79 
StdDev 0.34 0.33 0.37 143.42 | 132.37 141.81 
Median 0.83 0.88 0.89 2.23 4.30 3.60 
25quantile | 0.58 0.75 0.61 1.41 2.44 1.73 
75quantile | 0.90 0.92 0.94 11.07 12.24 16.79 
Table 4. Sensitivity and Specificity on Test Data 

Label Sens_ET | Sens_WT | Sens_TC | Spec_ET | Spec_WT | Spec_TC 
Mean 0.63 0.71 0.67 0.84 0.84 0.84 
StdDev 0.34 0.33 0.36 0.35 0.35 0.35 
Median 0.80 0.87 0.86 0.99 0.99 0.99 
25quantile 0.52 0.68 0.52 0.99 0.99 0.99 
75quantile 0.88 0.93 0.92 0.99 0.99 0.99 
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In this paper, we have built a 3D-UNet based architecture which allows more con- 
textual information to produce segmentation map for multimodal brain tumor 
segmentation task. Our model achieves mean value of Dice coefficient for whole 
tumor, tumor core and enhance part are 0.87, 0.76 and 0.73 respectively on 
validation set and 0.73, 0.67 and 0.63 respectively on test set. 

For further work, we can do ensemble on different set of dataset and can 
achieve better results, and also post processing can be applied to remove smaller 
volume class labels to decrease false positives. 
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Abstract. Patient MGMT (O° methylguanine DNA methyltransferase) 
status has been identified essential for the responsiveness to chemother- 
apy in glioblastoma patients and therefore depicts an important clinical 
factor. Testing for MGMT methylation is invasive, time consuming and 
costly and lacks a uniform gold standard. We studied MGMT status 
assessment by multi-parametric magnetic resonance imaging (mpMRI) 
scans and tested the ability of deep learning for classification of this task. 
To overcome the limited number of training examples we used a transfer 
learning approach based on the video clip classification network C3D [30], 
allowing for full exploitation of three dimensional information in the MR 
images. MRI sequences were fused using a locally connected layer. Our 
approach was able to differentiate MGMT methylated from unmethy- 
lated patients with an area under the receiver operating characteristics 
curve (AUC) of 0.689 for the public validation set. On the private test 
set AUC was given by 0.577. Further studies for assessment of clinical 
importance and predictive power in terms of survival are needed. 


Keywords: MGMT status - Glioblastoma - Transfer learning - Deep 
learning 


1 Introduction 


Glioblastoma (GB) represents a very aggressive form of malignant brain tumor 
with a relative 5-year survival rate less than 8% [21]. The standard therapy 
approach includes surgery followed by radiotherapy subsidized by concurrent 
and adjuvant chemotherapy with an alkylating agent, i.e. temozolomide (TMZ) 
[26]. 
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TMZ leads to disruption of DNA replication by addition of a methyl group 
to the O° position of guanine, ultimately resulting in apoptosis. However, the 
MGMT gene encodes a DNA repair protein that is able to remove alkyl groups, 
which inhibits the effects of TMZ. [27] Therefore, high levels of MGMT are an 
important determinant of treatment failure, making MGMT status an essential 
clinical factor. [8] 

MGMT status is typically determined using tissue sample based polymerase 
chain reaction (PCR) methods, but Han and Kamdar [7] have proven the 
ability of deep learning models to predict patients MGMT status based on 
multi-parametric magnetic resonance imaging (mpMRI) scans, allowing for non- 
invasive and fast testing. Task of the Brain Tumor Segmentation (BraTS) Chal- 
lenge 2021 [1—4,20] was the development of such a mpMRI scan based MGMT 
promoter methylation status prediction for glioblastoma patients. 

The training data set of the challenge involved 585 independent patients, with 
information about four different MRI sequences. Deep learning models typically 
require data sets of larger size. We tested the ability of transfer learning to 
overcome this need for large data set sizes by following the approach developed 
previously [15]. The video clip classification network C3D [30] was used as a 
feature extractor. Video data is available in large data set sizes and has the 
same three dimensional structure as MR images, with the third dimension being 
time. This allows for full exploitation of three dimensional information in the MR 
images. C3D processes its input data by 3D convolutional layers, i.e. handling 
all three dimensions in the same manner, which makes it a perfect fit as baseline 
model used for feature extraction. Feature vectors of the different MRI sequences 
were fused using a locally connected layer. 


2 Material and Methods 


The data set included three cohorts: training, validation and testing. The train- 
ing cohort involved 585 cases with available mpMRI scans and MGMT status, 
for the 87 validation cohort cases only mpMRI scans were publicly accessible 
and the testing set was completely hidden. Data acquisition involved multiple 
institutions, scanners and imaging protocols [1]. 

MRI sequences were given in the form of fluid-attenuated inversion recovery 
(FLAIR), T1 weighted with contrast enhancement (T1wCE), T1 weighted (T1lw) 
and T2 weighted (T2w) acquisition. Not all sequences were available for all 
patients, for missing sequences arrays filled with zeros were used. 

MGMT status was given as a binary label (methylated vs. unmethylated) 
with testing performed based on different assays including pyrosequencing 
and next generation quantitative bisulfite sequencing of promoter cytosine- 
phosphate-guanine sites [1]. The fraction of methylated/unmethylated cases in 
the training set was given by 307/278. 
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2.1 Preprocessing 


We performed a stratified split, based on patient MGMT status, to separate 
the training cohort into a train set of 497 cases and a tuning set with 88 cases. 
Re-orientation to the LPS (Lateral-Posterior-Superior) coordinate system was 
applied, all cases were resampled to a uniform voxel size of 1 mm x 1mm x 3mm, 
a minimum image size of 126mm x 126mm x 150mm was provided using zero 
padding and voxel values v; were normalized following 


Òi = SE x 255/8 + 255/3, (1) 


with u and ø the mean voxel value and standard deviation per image. 
In order to identify regions in the images that contain air only a binary voxel 
wise mask was generated based on MRI image voxel values using a threshold 


value of 1: 
1 ifu; >l 
iii lI v; > l (2) 
0 else 


with m; the voxel value of the mask and v; the respective MRI voxel value. 

Images were then cropped based on bounding boxes defined by the binary 
mask, under consideration of the minimal image size mentioned before. Voxels 
lying outside the mask were set to zero. 


2.2 Model 


Following the transfer learning approach [15], the video classification model C3D 
[30] pretrained on the Sports-1M data set [12] was used as feature extractor. C3D 
consists of 3D convolutional and max-pooling layers followed by dense layers, a 
scheme can be seen in Fig. 1. Application of the C3D video classification model 
as a feature extractor allows for full utilization of 3 dimensional information 
in the downstream task. This would not be possible for a model pretrained on 
imaging data (e.g. ImageNet [6]) which could only be trained on slices of the 
MR images. 


Convila || Conv2a |i} Conv3a || Conv3b Conv4a || Conv4b |}Ę]| Conv5a || Conv5b |i E 
64 || 128 || 256 256 512 512 |F{|__ 512 512 4096 |4096 [S 


Fig. 1. C3D model, taken from Tran et al. [30]. Convolutional layers, denoted Conv, 
feature kernels of size 3 x 3 x 3 and stride 1 for all dimensions, respective filter sizes are 
shown in the image. Max-pooling layers, denoted Pool, feature kernels of size 2 x 2 x 2, 
except for Pool1 with a kernel size of 1 x 2 x 2. Fully connected layers fc, highlighted 
in gray, were removed from the network and weights of the convolutional layers were 
kept fix. 


Pool, 
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The advantage of using C3D instead of another video classification network 
lies in the uniform handling of all dimensions in the input data by application 
of 3D convolution and pooling layers. Usually, newer video classification models 
handle the time dimension of the video data in a separate way (e.g. Xie et al. 
[32]), which does not fit the structure of medical imaging data in the downstream 
task. 

The model was trained to classify video clips of the Sports-1M data set, 
containing 1.1 million videos of 487 sports activities. Weights of the trained 
C3D model are available online [29]. 

We removed all dense layers of the pretrained model and kept weights of the 
convolutional layers fixed during training, i.e. no fine tuning was performed for 
the convolutional layers. A feature vector fÍ for each image j of the mpMRI 
sequence was generated by passing them through the convolutional layers of the 
C3D model. Input size of 112 x 112 x 48 voxels was chosen, resulting in feature 
vectors of size 8192. 

We then combined all feature vectors using a locally connected layer. Each 
neuron g; of the locally connected layer was only connected to one neuron f; 
from each of the four feature vectors ft-* by 


4 


9 = >> flu) +b, (3) 


I 


with w and b denoting the weights and bias of the layer. 

The locally connected layer was followed by dense layers of size 256 and 128, 
resulting in one output neuron. Dropout [25] with a probability of 0.5 followed 
by a ReLU activation layer was applied after the locally connected layer and 
all the dense layers. Dropout layers randomly set some of their neurons to zero 
with a given probability, it was shown that this technique helps to prevent the 
network from overfitting [11]. Sigmoid activation was used after the final output 
neuron. A scheme of the model can be seen in Fig. 2. 

During training augmentation methods included: flipping on the sagittal and 
coronal plane, rotation by a multiple of 90° and addition of gaussian noise with 
standard deviation of 5 and zero mean. Training cases were randomly cropped 
to the desired input size, validation cases were center cropped. 

The Adam optimizer [13] with a learning rate of 1074 was used to optimize 
the binary crossentropy loss and batch size was 16. All models were trained 
locally (on a Nvidia Tesla M60 graphics card) for 150 epochs. The best perform- 
ing models were chosen based on the minimal tuning set loss and selected to be 
evaluated on the public validation set. Finally, the two best performing models, 
based on the public validation set AUC score, were submitted to be evaluated 
on the private test set. 
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Fig. 2. MGMT classification model. Feature extraction is performed using the con- 
volutional part of the pretrained C3D model. The four resulting feature vectors are 
combined by the locally connected layer. The locally connected layer is followed by 
dense layers ending in one output neuron. Dropout with a rate of 0.5 and ReLU acti- 
vation is applied after the locally connected layer and all dense layers. The final output 
neuron is followed by a sigmoid activation layer. 


3 Results 


The best performing model achieved a training and tuning loss of 0.638 (0.623— 
0.653) and 0.649 (0.608—0.692), and an AUC score of 0.699 (0.660—0.737) 
and 0.685 (0.589—0.781). Errors were computed using bootstrap re-sampling 
of 10,000 samples and computation of 5% and 95% percentiles. A receiver oper- 
ating characteristics curve plot for the tuning set can be seen in Fig. 3. For a 
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Fig. 3. Tuning set results. Receiver operating characteristics curve plot with a area 
under the curve of 0.685. 
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threshold value of 0.5 in the sigmoid output layer the network achieved a sensi- 
tivity and specificity of 0.674 and 0.619 on the tuning set. Negative and positive 
predictive values were given by 0.634 and 0.660. 

For the public validation set the algorithm achieved an AUC score of 0.689 
and performance on the final private test set was given by 0.577. 


4 Discussion 


We presented a video data based transfer learning approach for classification 
of MGMT status in brain tumors based on mpMRI data. MRI sequences were 
processed by pretrained convolutional layers and then fused using a locally con- 
nected layer followed by dense layers. The network was able to discriminate 
MGMT methylated from unmethylated cases with an AUC of 0.685 and 0.689 
for the tuning and public validation set. However, the final AUC score on the 
private test set achieved only 0.577. 

Features of medical imaging data are affected by the application of differ- 
ent scanners and scan protocols [16,18]. Deep learning models are sensitive to 
such domain shifts between training and test set [9, 14,22]. Therefore, the inclu- 
sion of different image acquisition procedures can lead to strong performance 
drops [31]. Li et al. [17] trained a radiomics model for prediction of ATRX 
gene mutation status in lower-grade glioma patients and experienced a decline 
from 0.925 validation AUC to 0.725 when tested on external data. Hence, the 
multi-institutional property of the data set, involving several different scanners 
and imaging protocols, may explain the reduced performance on the test set. 
Furthermore, medical cohorts are typically several orders of magnitude smaller 
than data sets usually encountered in the domain of deep learning. The tun- 
ing/validation set of the problem at hand involved 88/87 cases. For such data 
set sizes at least small overfitting on the validation set is inevitable, also leading 
to a drop between tuning/validation and testing performance. However, for in 
depth analysis of mechanisms causing the inferior predictive power on the test 
set image acquisition information would be needed. 

The fusion of different imaging modalities by a locally connected layer allowed 
for construction of a model with relatively small number of trainable weights. 
For general verification of applicability, the method has to be tested on other 
classification problems. 

Typical AUC scores reached by machine and deep learning models on the 
task of MRI based MGMT status determination are ranging between 0.60 and 
0.90 [5,7,23,33]. Tixier et al. [28] showed that combination of mpMRI imag- 
ing features obtained by radiomics analysis and patient MGMT status has the 
power to better stratify patient into survival subcohorts than MGMT status 
alone. However, results of the BraTS Challenge 2021 demonstrated that improve- 
ments in robustness are inevitable for successful MRI based MGMT status 
determination. For deep learning, transfer learning is known to improve robust- 
ness [10], but no sufficient result could be achieved for the problem at hand. 
Rebuffi et al. [24] showed that, when combined with model weight averaging, 


312 D. M. Lang et al. 


data augmentation can also improve model robustness, but the method has to 
be tested in the medical domain. 

For clinical applicability, improvements in robustness have to be achieved 
and mechanisms leading to inferior performance on external data have to be 
identified. Current MGMT methylation status assays lack uniform methods and 
definitions, with no gold standard test at hand [19]. This prohibits direct com- 
parison with other testing methods. Determination of predictive power in terms 
of survival would be one possible way to circumvent this problem, but no ground 
truth survival data is available in the study data set. 


5 Conclusion 


We have tested the ability of video clip transfer learning in combination with 
image sequence fusion by a locally connected layer for MGMT status predic- 
tion in glioblastoma patients based on mpMRI data. Sufficient results could be 
achieved on the public validation set, for the private test set a drop in perfor- 
mance was encountered. Mechanisms leading to performance decline have to be 
analyzed and model robustness has to be improved for clinical applicability. For 
further verification, correlation with survival data is needed. 
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Abstract. In this paper, we propose a multimodal brain tumor segmentation using 
a 3D ResUNet deep neural network architecture. Deep neural network has been 
applying in many domains, including computer vision, natural language process- 
ing, etc. It has also been used for semantic segmentation in medical imaging 
segmentation, including brain tumor segmentation. In this work, we utilize a 3D 
ResUNet to segment tumors in brain magnetic resonance image (MRI). Multi- 
modal MRI is prevailing in brain tumor analysis due to providing rich tumor 
information. We apply the proposed method to the Multimodal Brain Tumor Seg- 
mentation Challenge (BraTS) 2021 validation dataset for tumor segmentation. The 
online evaluation of brain tumor segmentation using the proposed method offers 
the dice score coefficient (DSC) of 0.8196, 0.9195, and 0.8503 for enhancing 
tumor (ET), whole tumor (WT), and tumor core (TC), respectively. 


Keywords: Deep neural network - Tumor segmentation - Multimodal MRIs 


1 Introduction 


Glioblastoma (GB), and diffuse astrocytic glioma with molecular features of GBM 
(WHO IV astrocytoma), are the most common and aggressive malignant primary tumor 
in central nervous system (CNS), with extreme intrinsic heterogeneity in appearance, 
shape, and histology [1]. In each year, 23 out of 100,000 people are diagnosed with CNS 
brain tumors in the US [2]. According to the revised CNS tumors classification of world 
health organization (WHO), brain tumors are classified in considering of the integration 
of histology and molecular features, including glioblastoma, IDH-wildtype/-mutant, dif- 
fuse astrocytoma, IDH-wildtype/-mutant, etc. [3]. It is believed that the survival period 
of glioma patients is highly associated with tumor type [4]. Proper tumor classifica- 
tion is helpful for tumor treatment management. However, the median survival period 
of patients with glioblastoma (GBM) remains 12-16 months [5], even with modern 
treatment advancement. Brain tumor segmentation is of importance for brain tumor 
prognosis, treatment planning, and follow-up evaluation. An accurate tumor segmenta- 
tion could lead to a better prognosis. Manual brain tumor segmentation by radiologists 
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is tedious, time-consuming, and error-prone to raters [4]. Therefore, developing auto- 
matic computer-aided brain tumor segmentation is highly desired. Structural magnetic 
resonance imaging (MRI) is widely used for brain tumor study because of the non- 
invasiveness and soft tissue capturable ability. It is noticed that one single structural 
MRI is very challenging to segment all types of tumors due to imaging artifacts and 
complication of different tumors. Multi-parametric MRI (mpMRI) offers complemen- 
tary information for different tumors. The mpMRI sequences include T1-weighted MRI 
(T1), T1-weighted MRI with contrast enhancement (T1ce), T2-weighted MRI (T2), and 
T2-weighted MRI with fluid-attenuated inversion recovery (T2-FLAIR). Tlce and T2- 
FLAIR are usually considered good sources to identify enhancing tumor (ET)/necrosis 
(NC) and peritumoral edema (ED), respectively. 

There are many works on brain tumor segmentation in the literature. The proposed 
methods are threshold-based, region-based, conventional machine learning-based meth- 
ods [6-11], etc. However, the threshold-based methods and region-based methods are 
out of date because setting a proper threshold is very difficult. These methods are inca- 
pable for high-quality multi-tissue separation. Tumor segmentation is also considered 
as a classification issue. As such, conventional machine learning-based methods have 
become popular for tumor classification. However, the prerequisite of hand-crafted fea- 
ture extraction and follow-up feature selection is very challenging for such methods. It 
requires advanced knowledge of computer vision and a good understanding of radiology, 
which limits its applications. Recently, deep learning attracts much attention because of 
its success in many domains, such as computer vision [12], medical imaging analysis 
[13], etc. In comparison to conventional machine learning-based methods, feature extrac- 
tion and selection are automatically completed by using deep learning-based methods 
[12, 14-17]. In addition, these deep learning-based methods are appliable for multiclass 
issues. 

In this work, we use a 3D ResUNet for brain tumor segmentation. The 3D ResUNet 
architecture is composed two parts, an encoding part, and a decoding part. The encoding 
part extracts high dimensional convolutional features from the input. Oppositely, the 
decoding part transfers the extracted convolutional features to classification label maps. 
The weights of neurons are adjusted driven by loss between the classification label 
maps with the corresponding ground truth until the loss reaches a small value or defined 
threshold. 


2 Method 


2.1 Brain Tumor Segmentation 


For a high-grade glioma patient, a typical brain tumor has multi-subtype tumors: enhanc- 
ing tumor (ET), non-enhancing tumor (NET), necrosis (NC), and peritumoral edema 
(ED). However, there is difficult to distinguish the NET and ED in clinical, even for 
professional radiologists. These subtype tumors show on mpMRI with different appear- 
ances. T2 and T2-FLAIR are mainly used for identifying ED because it shows a strong 
signal, while Tlce sequence is employed for distinguishing ET. Even with mpMRI, 
identifying all subtype tumors is still challenging due to many factors, such as imag- 
ing artifacts, image acquisition quality, intensity inhomogeneity, etc. In general, deep 
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learning-based methods outperform the traditional machine learning methods in many 
applications, such as image semantic segmentation, face detection, etc. [18]. 

To achieve accurate brain tumor segmentation, we propose a 3D ResUNet deep 
learning-based method. The proposed architecture is showing in Fig. 1. The 3D ResUNet 
architecture consists of two parts: an encoding part and a decoding part. The encoding 
part extracts high dimensional convolutional features from the input, and the decoding 
part oppositely transfers the extracted convolutional features to segmentation label maps. 
The computational loss of the label maps and ground truth drives the voting weights 
adjustment through an optimizer. 


jH p 


Conv3d X Concatenate 


| 
Down sampling | Conv3d for up sampling 


LY 
Up sampling @ Add 


Fig. 1. The proposed ResUNet architecture. 


3 Materials and Pre-processing 


3.1 Data 


In the experiment, there are 1251 cases with mpMRI obtained from the Multimodal 
Brain Tumor Segmentation Challenge 2021 (BraTS 2021) [5, 18-21]. Different from 
previous BraTS challenges, the BraTS 2021 has the largest dataset ever, and there is 
no indication of high-grade glioma (HGG)/low-grade glioma (LGG) information. Each 
patient case contains multi-parametric MRI (mpMRI), including T1, T1-ce, T2, and T2- 
FLAIR. These clinically acquired mpMRI scans are co-registered, skull-stripped, and 
denoised [20]. Each image has a uniform size of 240 x 240 x 155 across cases. A typical 
brain tumor of HGG cases has multiple subtype tumors: necrotic (NC), peritumoral 
edema (ED), and enhancing tumor (ET). Ground truth of the training data is public for 
all participants. However, the ground truths of validation and testing data are privately 
owned by the challenge organizer and are not available for participants. The participants 
are allowed to submit the segmentation result online multiple times through the Synapse 
Evaluation Platform for evaluating their methods. It is noticed that the evaluation of 
the BraTS 2021 is based on three tumor subregions and Hausdorff distance. The three 
tumor sub-regions are enhancing tumor, tumor core (TC), and whole tumor (WT). TC is 
the combination of ET and NC, while the WT is all abnormal tissues. In the validation 
phase, there are 219 cases with the same format and type images as training data. 
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3.2 Pre-processing 


Since the challenge data is acquired from multiple centers, the intensity scale could 
vary. Therefore, it is necessary to apply intensity normalization to minimize the impact 
of intensity variance across cases and modalities. There are several methods for intensity 
normalization. One popular method is z-score intensity normalization applied in brain 
regions in the mpMRIs. The z-score normalization ensures intensity with zero mean and 
unit standard deviation (std) [22]. In the experiment, we apply the z-score normalization 
for all cases. Figure 2 illustrates an example of image comparison before and after z-score 
normalization. 


Fig. 2. An instance of intensity normalization. Top figures: raw images, and bottom figures: nor- 
malized image using z-score normalization. From left to right: T2-FLAIR, T1, Tice, and T2. 
Bottom from left to right. 


4 Experiments and Results 


4.1 Hyper-parameter Setting 


All images in the experiment have a size of 240 x 240 x 155. Due to the limited graphics 
processing unit (GPU) resource, we randomly crop all mpMRI images with a size of 
128x128x128 to fit the proposed deep neural network. To maximize the processed patch 
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size, we set the batch size as 1 for the proposed 3D ResUNet. The loss function is 
computed using cross-entropy as follows: 


L = —(ylog (p) + (1 — y)log (l — p)), (1) 


where p and y are the class prediction and ground truth (GT), respectively. 
We set the training epoch as 200, and use Adam [23] optimizer with an initial learning 
rate of lro = 0.001 in training phase, and the learning rate (/r;) is gradually reduced.: 


j 0.9 
n=ms(1-5) (2) 


where i is epoch counter, and N is a total number of epochs in training. 


4.2 Measurement Metric 


In the experiment, there are two main measurement metrics: dice similarity coefficient 
(DSC) [24] and Hausdorff distance (HD). The DSC is computed as following: 


2TP 


DSC = ——.———_, 
FP +2TP + FN 


(3) 
where TP, FP, and FN are the numbers of true positive, false positive and false negative, 
respectively. The HD measures the distance between the predicted segmentation with 
the corresponding ground truth, as following: 


HD95 = percentile ( maxacprea Minne e (d (pred i gt)) , 95) 4) 


4.3 Tumor Segmentation 


For the brain tumor segmentation task, we utilize a 5-fold cross-validation scheme to 
train models. Figure 3 shows a case with segmentation using the proposed method in 
multiple views. In the image, the green, blue, and yellow represents necrosis (NC), en- 
hancing tumor (ET), and edema (ED). Figure 4 demonstrates another two examples with 
the complete multimodal images and segmentations generated by the proposed method. 
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Fig. 3. An example of tumor segmentation using the proposed method. From left to right: T1ce 
overlaid with predicted segmentation in axis view, sagittal view, and coronal view, respectively. 
Color code: green, blue, and yellow represents NC, ET, and ED, respectively. 


ID T1 Tice 


& A 


Edema Necrosis MEM Enhancing tumor 


Prediction 


T2-FLAIR 


BraTS2021_00001 


BraTS2021_00013 


Fig. 4. Two cases of BraTS2021. Each case has four image modalities (from left to right): T1, 
Tlce, T2, and T2-FLAIR. The predicted label using our deep learning model is showing in the last 
column. Color code on the predicted label: yellow, green, and blue represents edema, necrosis, 
and enhancing tumor, respectively. 


4.4 Online Evaluation 


After we obtained models from the training phase, we then apply the trained mod- 
els to BraTS 2021 validation dataset and evaluate the performance through the online 
portal. There are 219 cases with unknown tumor grade. The online evaluation of our 
segmentation achieves average DSC as 0.8196, 0.8503, and 0.9195 for ET, WT, and TC, 
respectively. Hausdorff distance (HD), a matric measuring the spacing distance between 
segmentation and ground truth, is also provided by the online evaluation. A smaller 
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HD indicates a better segmentation. The average of HD at 95 percentiles is 17.89 mm, 
9.89 mm, and 4.3 mm for ET, WT, and TC, respectively. 


Table 1. Brain tumor segmentation performance using the online evaluation of BraTS 2021 
validation and testing dataset. 


Phase Dice_ET Dice_WT Dice_TC Hausdorff95_ET Hausdorff95_WT Hausdorff95_TC 
Validation 0. 8196 0. 899 0. 8503 17.89 43 9.89 


The online evaluation performance shows the proposed method has good perfor- 
mances, with high DSC and low HD in validation phase. The Hausdorff distances are 
smaller in ET and TC in testing phase comparing to validation phase. 


4.5 Online Testing Evaluation 


To participate the BraTS 2021 challenge, instead of submitting the segmentation results, 
all participants are required to submit the models/methods wrapped with Docker via the 
online submission portal. The challenge organizer applies the models/methods to the 
testing data to evaluate the performance. 


Phase Dice_ET Dice_WT Dice_TC Hausdorff95_ET Hausdorff95_WT Hausdorff95_TC 
Testing 0. 859 0. 916 0. 862 12.62 6.17 18.22 


Comparing to the performance in validation phase, the dices of enhancing tumor 
(ET), whole tumor (WT), and tumor core (TC) are higher in the testing phase. However, 
the Hausdorff distances of WT and TC are worse. 


5 Conclusion 


In the paper, we utilize a deep learning-based method, namely ResUNet for brain tumor 
segmentation. The ResUNet is composed of an encoding and a decoding part. The 
online evaluation suggests a promising performance on both brain tumor segmentation 
and overall survival prediction. 
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Abstract. Manual segmentation of the Glioblastoma is a challenging task for the 
radiologists, essential for treatment planning. In recent years deep convolutional 
neural networks have been shown to perform exceptionally well, in particular the 
winner of the BraTS challenge 2019 uses 3D U-net architecture in combination 
with variational autoencoder, using Dice overlap measure as a cost function. In 
this work we are proposing a loss function that approximates Hausdorff Distance 
metric that is used to evaluate performance of different segmentation in the hopes 
that it will allow achieving better performance of the segmentation on new data. 


Keywords: Brain tumor - U-Net - Variational autoencoder - Hausdorff distance 


1 Introduction 


Brain and other nervous system tumors were the leading cause of cancer death among 
men younger than 40 years and women younger than 20 years in the USA in 2017 
[1]. Glioblastoma (GBM) is the most common malignant primary brain tumor making 
up 54% of all gliomas and 16% of all primary brain tumors, with an incidence rate 
of 3.19 per 100,000 persons in the USA [2]. GBM Treatment is complex, consisting 
of tumor resection, followed up by radiation therapy and chemotherapy. Delineation 
and segmentation of the tumor and its subregions is a complicated and time-consuming 
manual task essential for treatment planning. The RSNA ASNR MICCAI Brain Tumor 
Segmentation (BraTS) 2021 challenge is set up to evaluate performance of various 
methods of automatic delineation of the tumor boundaries and sub-regions based on a 
large collection of MRI scans of patients with various brain tumors [3]. 

Since the MRI signal is dependent on proton density and tissue relaxation parameters, 
it is an ideal imaging modality to study brain tumors. By changing the acquisition 
parameters, the signal intensity can be associated with different characteristics of the 
tumor. For example, oedema surrounding the tumor has a medium to dark intensity 
on T1 and Tlc, and is often brighter than GM or WM in FLAIR and T2. The tumor 
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itself can be broken up roughly into core, enhancing, necrotic and cystic regions. The 
non-enhancing core is brighter than CSF and often darker than GM or WM on T1 and 
Tic, but can be sometimes brighter if the tumor has high protein, fat, cholesterol or 
melanin levels. With FLAIR contrast, the non-enhancing core is often darker than GM 
or WM, but not as dark as CSF. In T2, it is brighter than GM or WM due to higher 
water content, but not as bright as CSF. The active part of the tumor, the enhancing core, 
is very bright in T1 contrast images due to gadolinium-based contrast agents leaking 
through the weakened blood brain barrier in new blood vessels feeding the tumor cells. 
With less nutrients and oxygen, cells die and form the solid necrotic region of the tumor 
that is darker than non-enhancing tumor in T1, T1c and FLAIR images, and may have a 
speckled appearance in T2. The necrotic cystic regions of a tumor are filled with liquid, 
and thus have an intensity similar to CSF in T1, Tlc, T2 and FLAIR images. We note 
that in BRATS data, the necrotic region is not differentiated from the tumour core in the 
training labels. 

With their ability to represent very complex distributions, deep convolutional neural 
nets are ideal to model the intensities of the different brain tumor regions. 

Results from the previous BraTS competitions [4, 5] showed that the best performing 
methods used various forms of convolutional neural networks (CNN). In particular, the 
winner of the segmentation part of BraTS 2019 challenge used a U-net architecture 
combined with variational auto-encoder regularization [5, 6]. 

The design of our network is inspired by the one published by Myronenko et al. [6]. 
The main difference between that work and ours is in the loss function where we integrate 
a combination of both the Dice kappa overlap metric [7] and a multi-label Hausdorff-like 
distance approximation, inspired by the single label Hausdorff approximation suggested 
by Karimi et al. [8]. 


2 Methods 


Our deep learning convolutional neural network is based on a 3D version of the U- 
NET architecture [9] which is frequently used for semantic segmentation of 3D medical 
images. This architecture consists of two parts: an encoder, where image features of 
different levels of details are extracted and a decoder, which combines features to produce 
segmentation results. In addition to the encoder and decoder branches, our network 
includes a variational auto-encoder (VAE) branch, similar to the work of Myronenko 
[6], that is designed to re-create the input image, with the idea that it provides additional 
regularization to the network parameters. 
Overall design of the proposed network is shown on Fig. 1. 
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Fig. 1. DNN architecture 


2.1 Loss Functions 


The common measure of the goodness of anatomical region segmentation is the Dice 
kappa overlap measure; many methods use this metric directly as a loss function for 
CNN training. However, in case of multiple labels weighted sum of separate Dice overlap 
measurements are often used. In our approach we decided to use the cross-entropy loss 
function instead, since it produces smoother gradients needed for training. 

The common problem with either Dice overlap or cross-entropy loss functions is that 
they don’t geometrically localize the errors in segmentation, whereas accurate tracing 
of the border of the tumor is very important in planning surgery or radiotherapy. The 
Hausdorff distance measure is another metric used to estimate quality of the segmentation 
results, which is sensitive to the geometric properties of the segmentation results, but 
it is difficult to use as a loss function to train DNNSs, since it is not differentiable in its 
classic form, and in addition, is not stable in case of noisy data [8]. Previously, a method 
suitable for use in DNN was proposed in [8], however it was formulated for a single label 
problem. We propose a modification to the one-sided distance-transform loss function, 
described in [8], to extend it to the multi-label case. 


Losspr—os(q P) = ae q? ag) (1) 


Equation |, shows the one-sided Hausdorff-like distance loss function, as introduced 
n [8], where 2 denotes volume of interest, p denotes binary labels for the ground-truth 
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segmentation, q denotes probability labels for the DNN output and dp is unsigned bi- 
directional distance from the border of p, and « is an adjustable parameter. Since dp does 
not depend on the current estimation of the segmentation, this value is precomputed in 
advance, and computation of the loss function carries similar numerical complexity as 
cross-entropy or Dice kappa. In our methods we used the “Exact Euclidean distance 
transform” from scipy [12]. 

To extend the loss function to multilabel segmentation problem we propose two loss 
functions, given set of labels L (including background label), and u; p - the unidirectional 
distance from the border of the structure with label / outside the structure, and zero inside 
the structure. It’s easy to see that in case of two labels (background and foreground), and 
B = 2, this loss is equivalent to the loss in Eq. 1 


1 1 
LOSSmean (4, P, L) = IZI > I2] CA ` up) 2) 
ICL 2 


The goal of the second loss function is to mimic the sparse nature of the real Hausdorff 
distance more closely: 


1 
Lossmax(q, p, L) = Z >. maxo (d . TA) (3) 
ICL 


In our experiment we used parameters a = 1, 6 = 1, but it’s possible to find better 
parameters using cross-validation. 
Our total loss function was following: 


Loss = WceE - LosscE + Wmean - Lossmean + Wmax + Lossmax + Wvyag > Lossyag + Wxz - LossxKy (4) 


where Lossce- cross-entropy loss, Lossyag- L2 norm of the variational autoencoder 
reconstruction error, Lossx,- Kulback-Leibler norm of the difference of VAE parameters 
from the normal distributions with zero mean and unit standard deviation, as described 
in [4]. 

The weights of each loss were chosen empirically, based on [4] and our internal 
experiments: Wcg = 1.0, Winean = 0.1, Wmax = 0.01, Wyag = 0.1, Wet = 0.1 


2.2 Data Preprocessing 


To normalize intensity ranges for all MRI scans, we used histogram matching to calcu- 
late the intensity scaling coefficient to match the reference subject BraTS2021_00000 
intensity distribution within brain mask. 


2.3 Data Augmentation 


In order to make segmentation robust with respect to the possible perturbations seen in 
MRI scans, we used two kinds of data augmentation: (i) offline geometric transforma- 
tion, where random affine transformations were applied to each dataset, and results and 
distance transformations needed for Hausdorff-like cost function were pre-computed; 
(ii) online signal augmentation where random signal shift, amplification and voxel-level 
additive noise were added (after signal intensity Z-transformation) to the images each 
time data was used for training the DNN. We generated 32 offline-augmented datasets 
for each original dataset. 
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Table 1. Data augmentation parameters 


Geometric shift u = 0.0 mm, o = 2.0 mm 
Rotation around X,Y,Z u = 0.0 deg, o = 10 deg 
Geometric scaling X,Y,Z u = 1.0, ø = 0.03 
Intensity shift u = 0.0, o = 0.1 
Intensity amplification u = 1.0, 0o = 0.1 

Voxel level additive noise u = 0.0, o = 0.1 


2.4 Model Training 


For the final training before submission, we split off-line augmented datasets into 
two sets: training (datasets corresponding to 1241 subjects) and validation (10 unique 
subjects). 

To train DNN we used AdamW: variant of the Adam optimization algorithm [10] 
with Decoupled Weight Decay Regularization [11]. We used 100 warm up iterations with 
learning rate of le—7, followed by regular training with learning rate le—4, we used 
weight decay (L2 regularization weight) of le—4. Training was done for 100 epochs. 

During training we extracted random patches of 144 x 144 x 144 voxels from each 
dataset. Four available imaging modalities were concatenated as four input channels to 
the DNN. The output of the DNN was a four-channel probability map (after softmax) 
corresponding to the Background (BKG), enhancing tumor (ET), the tumor core (TC) 
and necrosis, and the whole tumor (WT). After the end of each epoch DNN was applied 
to the online validation subset to calculate generalized overlap kappa and symmetric 
Hausdorff-distance. Models corresponding to the best performance in terms of kappa 
overlap and HD were stored to be used for the final submission. For the final result, we 
used the weights of DNN corresponding to the epoch that achieved the best generalized 
Dice overlap ratio. 


2.5 Inference 


For the inference, the DNN was applied to the patches of 144 x 144 x 144 voxels that 
were extracted from the MRI scans, with 4 channels corresponding to the 4 available 
imaging modalities. Patches were extracted from the MRI scans with a stride of 64 
voxels, resulting tissue probability maps were center-cropped to 128 x 128 x 128 voxels 
to minimize edge effects; overlapping areas were merged using exponential averaging, 
and final segmentation was created by choosing the label with highest probability. 


3 Results 


DNN was implemented using pytorch version 1.9.0 [14], using the BraTS-2021 training 
dataset (1251 subjects) [3, 15—18], we didn’t use any additional data. We split the BraTS 
2021 training dataset into two subsets: 1241 training and 10 on-line validation. We 
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evaluated the impact of several parameters: offline data augmentation, use of mean and 
max distance loss, and regularizing effect of variational autoencoder. 

DNN training was performed on two systems: (i) the Nvidia DGX-1 system, con- 
sisting of 8x Nvidia Tesla V100 GPU with 16 Gb of RAM each and (ii) a cluster of 
two workstations with Nvidia RTX-3090 (24 Gb RAM) connected via 10 Gb Ethernet 
link. In both cases a distributed data parallel scheme was used to utilize all available 
GPUs. Batch size was adjusted based on the available RAM for each system: DGX-1 
used batch size of 8 x 2 samples and cluster with RTX-3090 used 3 x 2 samples. 

Training one epoch after offline data augmentation took 2.5 h on DGX-1, because the 
number of data samples was increased by a factor of 32, without offline augmentation 
one epoch took 7.5 min on the cluster with two RTX-3090. 

In order to estimate the effect of using different loss functions as and offline data 
augmentation, we performed five experiments: (i) with offline data augmentation and 
all loss functions described above; (ii) without offline data augmentation but with all 
loss functions; (iii) without offline data augmentation and without Lossmax; (iv) without 
offline data augmentation, without Lossmax, without Lossmean; (v) without offline data 
augmentation and without VAE regularization. 

The resulting DNN was used to segment the validation dataset that was uploaded 
to the BraTS 2021 online evaluation system. Performance is shown on Fig. 2. Overall, 
use of offline data augmentation, VAE regularization and Lossmean, seem to improve 
performance of the DNN. 

Performance of the submitted model on the testing dataset is shown in Table 2. 


Table 2. Performance on the testing dataset 


Mean StdDev Median 25th quantile 75‘) quantile 
Dice_ET 0.8145 0.2151 0.8829 0.8017 0.9269 
Dice_WT 0.9060 0.1266 0.9432 0.8981 0.9646 
Dice_TC 0.8463 0.2520 0.9408 0.8783 0.9681 
Sensitivity_ET 0.8437 0.2357 0.9345 0.8542 0.9665 
Sensitivity_WT 0.8959 0.1399 0.9370 0.8757 0.9723 
Sensitivity_TC 0.8441 0.2547 0.9460 0.8703 0.9757 
Specificity_ET 0.9996 0.0004 0.9997 0.9995 0.9999 
Specificity_WT 0.9995 0.0008 0.9997 0.9994 0.9999 
Specificity_TC 0.9997 0.0006 0.9999 0.9997 1.0000 
Hausdorff95_ET 19.5828 75.9712 2.2361 1.4142 3.3166 
Hausdorff95_WT 7.3669 31.8179 2.2361 1.4142 4.8990 
Hausdorff95_TC 22.3228 80.1173 2.0000 1.0000 4.1231 
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Fig. 2. Performance of the DNN trained with different settings, red numbers represent median 
values. 


4 Discussion and Conclusion 


In this paper we proposed a modification of previously-published semantic segmentation 
DNN for brain tumor segmentation. Our contributions are use of the cost function which 
is more closely related to the clinical requirements and use of a data augmentation 
scheme that more closely mimics potential variations of the clinical data. 

Our experiments with different combinations of loss functions and data augmen- 
tation, showed that extensive data augmentation has a similar impact on the final per- 
formance as any of the proposed additional loss function, and that there is a small but 
noticeable improvement of the performance when using Lossmean function in addition 
to the cross-entropy and variational autoencoder regularization. 

Since we do not have access to the test labels, we can only suggest interpretations of 
the test results. For example, the mean Dice_ET is much smaller than the median, and 
the StdDev is high. This might be due to cases where no manual ET labs exist, but the 
proposed technique finds some labels or vice versa. 

The median Hausdorff metrics are very good (all <2.25 mm), however the mean 
values are quite large - this would indicate that a post-processing step would be useful 
to remove extra voxels, those disconnected from the main regions. 
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Abstract. We propose a 3D version of the Contextual Multi-scale Multi-level 
Network (3D CMM-Net) with deeper encoder depth for automated semantic seg- 
mentation of different brain tumors in the BraTS2021 challenge. The proposed 
network has the capability to extract and learn deeper features for the task of multi- 
class segmentation directly from 3D MRI data. The overall performance of the 
proposed network gave Dice scores of 0.7557, 0.8060, and 0.8351 for enhancing 
tumor, tumor core, and whole tumor, respectively on the local-test dataset. 


Keywords: Brain tumor segmentation - Pyramid pooling module - U-Net - 
Glioblastoma - 3D semantic segmentation - Multimodal MRI 


1 Introduction 


The incidence rate of primary brain tumors is 11—12 per 100,000 populations. Gliomas 
are the most common brain tumors, accounting for about 50% of the diagnosed brain 
tumors, and 26% of them are considered to be astrocytic tumors [1]. In particular, 
glioblastoma (GBM) accounts for 50-60% of all gliomas, and it has the highest malig- 
nancy among gliomas. Therefore, it is important to accurately segment brain tumors in 
order to improve the diagnosis and hence and the appropriate treatment. 

Magnetic Resonance Imaging (MRI) plays an important role in diagnosing brain 
tumors. Since 2011, the Brain Tumor Segmentation (BraTS) challenge has led to the 
development of automated segmentation networks to segment brain tumors using 3D 
multimodal MRI data. The data provided by BraTS have different contrasts and include 
T1, T2, Fluid-Attenuated Inversion Recovery (FLAIR), and T1 Contrast-Enhanced 
(TICE) [2-6]. Figure 1 shows examples of these four images along with the brain 
tumor mask. In the BraTS2021 challenge, a total of 1,251 patient data were provided 
with their brain tumor masks for training. However, 219 additional data without their 
mask labels were given for validation. The input image size for all data is 240 x 240 x 
155 voxels. The label mask consisted of three classes: Edema (ED), Enhancing Tumor 
(ET), and Necrosis (NE) where the Tumor Core (TC) is defined as the sum of ET and 
NE, and the Whole Tumor (WT) is composed of the sum of ED, ET and NE. In the 
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rightmost image of Fig. 1, the green part indicates ED, the yellow part indicates the ET, 
and the red part means NE. 

In this work, we propose a 3D version of the Contextual Multi-scale Multi-level 
Network (3D CMM-Net) [7] with deeper encoder depth for automated semantic seg- 
mentation of different sub-regions of brain tumors in the BraTS2021 challenge. The 
proposed network involved multiple pyramid pooling modules which have the possi- 
bility to get multi-scale feature maps in each level of the encoder and the capability to 
extract and learn deeper features for the task of multi-class segmentation directly from 
3D MRI data. 


Fig. 1. Example of BraTS2021 dataset. From left, T1, T2, FLAIR, TICE, and brain tumor mask 


2 Method 


2.1 Data Preprocessing and Augmentation 


To reduce the computation complexity during training and improve the overall perfor- 
mance, we applied some preprocessing procedures to our dataset. First, we normalize all 
input images using zero mean and unit standard deviation. Then, we cropped all dataset 
using the center spatial crop from 240 x 240 x 155 to 128 x 128 x 128 voxels. This 
cropping process enables to reduce the size of input images and hence maintaining lower 
computation cost during training. It is of note that all the cropped data still includes the 
structure of brain tissue as well as the tumors. In order to take the advantage of the 
presence of four different image modalities (i.e., T1, T2, FLAIR, and TICE), we con- 
catenated all four types and utilized them as an input to our network. This could help in 
extracting various spatial features during training and enhance the overall segmentation 
of brain tumors. 

Moreover, we use different data augmentation techniques to enlarge our training data. 
We randomly flip all input images with a probability of 0.4 and rotate them multiple times 
in the x-y axis with a probability of 0.4 between 90 and 270°. Finally, we randomly adjust 
the contrast of the input images, which is a kind of gamma correction, with a probability 
of 0.5. 


2.2 Dilated Convolution 


Dilated convolution is a method of forcibly increasing the receptive field by adding zero 
padding inside the filter [8]. The advantage of using dilated convolution compared to 
the conventional standard convolution is its ability to increase the receptive field with 
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maintaining the same number of weights in the convolution kernel. Basic convolution 
and the dilated convolution are defined as: 


fkl whl = 0 fled wi- k] (1) 
flew =o" fle wire- 6) D 


where f[x] and w[x] are a discrete input image and a discrete filter or kernel, respec- 
tively. In (1) and (2), the ” x’ means convolution and the ’ -’ indicates multiplication 
operator. Dilated rate ’r’ in (2) means the gap of the weights’ location in the convolution 
kernel. The larger the ‘r’ value implies the larger the size of the receptive field, where 
the loss of information in spatial dimension is small. Figure 2(a) demonstrates how the 
dilated convolution works when r = 2 and kernel size is 3 x 3. Due to the charac- 
teristic of maintaining spatial information, dilated convolution is particularly used for 
segmentation. 


2.3 Pyramid Pooling Module 


The primary advantage of the Pyramid Pooling Module (PPM) is that it can obtain both 
local and global features at the same time [9]. Here, we explain step by step how the 
PPM proceeds. A pooling kernel of a different size is applied to each pyramid. As shown 
in Fig. 2 (b), the spatial size of the feature maps for each pyramid after pooling is 2 x 2 
x2,4x 4x 4,8 x 8 x 8, and 16 x 16 x 16. After that, using 1 x 1 x 1 convolution 
reduces the number of channels in the feature map for each pyramid by dividing it by 
the number of pyramids (i.e., four in this work). For example, Fig. 2 (b) shows four 
pyramids. So, the number of channels in the feature map after convolution is reduced to 
a quarter compared to the previous feature maps. Then, through upsampling, the feature 
maps in each pyramid are resized to be equal to their original size just before applying 
the PPM. Finally, all these feature maps are concatenated with the original one. Then 
the number of channels on output from PPM is going to be double compared to the input 
of PPM. 

The Half Pyramid Pooling Module (HPPM) located in the bottleneck of the proposed 
network as shown in Fig. 3 has little difference from the PPM. HPPM only concatenates 
the feature maps in each pyramid without adding the previous original input feature 
maps. This is due to the number of channels of feature maps in the bottleneck of the 
proposed network being very large and causing GPU memory limitation if PPM is used. 
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Fig. 2. (a) Description of the dilated convolution with dilated rate r = 2 of 3 x 3 kernel, (b) The 
pyramid pooling module with four pyramids where each pyramid has a size of 2 x 2 x 2,4 x 4 
x 4,8 x 8 x 8, and 16 x 16 x 16 


2.4 Network Architecture 


We use the CMM-Net [7] as our backbone since it has an attractive advantage of seg- 
mentation tasks in the medical domain. In this work, we develop a 3D version of the 
existing 2D CMM-Net and enlarge the depth of the encoder with two HPPM blocks in 
the bottleneck of the network as shown in Fig. 3. We apply dilated convolution to the 
whole convolution blocks in our network in order to enlarge the receptive field without 
increasing the number of weights in the convolution kernel as well as use the PPM in 
the encoder part to get the multiscale feature map at once. We tabulate thoroughly the 
structure of our model in Table 1. 
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2.4.1 Loss 


The proposed 3D CMM-Net is optimized by minimizing the dice loss [9]. Dice loss is 
computed as: 


2: > Ptrue - Ppred 
D Piue + D Peed +e 


where Pure and Pprea indicate the label mask provided from BraTS and the predicted 
mask of our model, respectively. Summation in (3) is computed as voxel-wise and € 
prevents from zero division. Since the output of the proposed network has 3 channels 
for TC, WT, and ET except for the background class, we have applied the dice loss per 
each channel of the output. 


(3) 


Lice = 


2.4.2 Optimization 


We use Adam optimization algorithm when we train the model with initial learning rate 
ay = 1e~* and make it gradually decrease as: 


g 0.9 
u=a-(1- <) (4) 


where ’e’ counts the current epochs and ‘N,’ is the total number of epochs. We use 100 
epochs in our case. We implement our network using Pytorch [10] and train it on one 
NVIDIA GeForce RTX 3090 24 GB GPU. 
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Fig. 3. Proposed 3D CMM-Net architecture 


3 Result 


To train the proposed network, we used the BraTS2021 training dataset that contains 
1,251 patients without additional data. Before releasing the validation dataset from the 
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BraTS, we randomly selected 51 patients among the BraTS2021 training dataset as a 
test dataset to evaluate the performance of the proposed network. We call this subset 


Table 1. Detailed Architecture of the proposed 3D CMM-Net where BN stands for Batch Nor- 
malization, Conv3d-r: 3 x 3 x 3 convolution with dilated rate r, MP: Multiscale Pooling in PPM 
as shown in Fig. 2(b), Conv: 1 x 1 x 1 convolution, Up: 3D linear spatial upsampling 


Name Contents Output size 

Input Cropped & Concatenated 4ch image 4 x 128 x 128 x 128 
Conv1 Conv3d-6 — ReLU — BN — Conv3d-6 — ReLU — BN 32 x 128 x 128 x 128 
PPM1 MP — Conv — Upsample — Concat 64 x 128 x 128 x 128 
Concatl | Conv! + PPM1 96 x 128 x 128 x 128 
Pooll Max Pooling 96 x 64 x 64 x 64 
Conv2 Conv3d-5 — ReLU — BN — Conv3d-5 — ReLU - BN 96 x 64 x 64 x 64 
PPM2 MP — Conv — Upsample — Concat 192 x 64 x 64 x 64 
Concat2 | Conv2 + PPM2 288 x 64 x 64 x 64 
Pool2 Max Pooling 288 x 32 x 32 x 32 
Conv3 Conv3d-4 — ReLU — BN — Conv3d-4 — ReLU —- BN 256 x 32 x 32 x 32 
PPM3 MP — Conv — Upsample — Concat 512 x 32 x 32 x 32 
Concat3 | Conv3 + PPM3 768 x 32 x 32 x 32 
Pool3 Max Pooling 768 x 16 x 16 x 16 
Conv4—1 | Conv3d-3 — ReLU — BN — Conv3d-3 — ReLU - BN 1024 x 16 x 16 x 16 
HPPM1 | MP - Conv — Upsample 1024 x 16 x 16 x 16 
Conv4—2 | Conv3d-3 — ReLU - BN — Conv3d-3 — ReLU - BN 1400 x 16 x 16 x 16 
HPPM2 | MP - Conv — Upsample 1400 x 16 x 16 x 16 
Conv4—3 | Conv3d-3 — ReLU — BN — Conv3d-3 — ReLU - BN 2048 x 16 x 16 x 16 
Up1 Upsample 2048 x 32 x 32 x 32 
Concat4 | Concat3 + Up1 2816 x 32 x 32 x 32 
Conv5 Conv3d-4 — ReLU — BN — Conv3d-4 — ReLU — BN 256 x 32 x 32 x 32 
Up2 Upsample 256 x 64 x 64 x 64 
Concat5 | Concat2 + Up2 544 x 64 x 64 x 64 
Conv6 Conv3d-5 — ReLU — BN — Conv3d-5 — ReLU — BN 96 x 64 x 64 x 64 
Up3 Upsample 96 x 128 x 128 x 128 
Concat6 | Concatl + Up3 192 x 128 x 128 x 128 
Conv7 Conv3d-6 — ReLU — BN — Conv3d-6 — ReLU — BN 32 x 128 x 128 x 128 
Conv-out | Conv 4 x 128 x 128 x 128 


Output 


Predicted masks with 4 different classes for each channel 


4 x 128 x 128 x 128 
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a ‘local-test’. On the local-test dataset, the proposed 3D CMM-Net obtained the dice 
scores per class of ET: 0.7557, TC: 0.8060, and WT:0.8351 as shown in Table 2. 

We proceed with the ablation study by changing the structure of 3D CMM-Net in 
order to figure out whether HPPM works well or not. 


3D CMM-Net 3D CMM-Net 
FLAIR Label wo/HPPM w/ HPPMs 


1619 
patient 


1648 
patient 


Fig. 4. Prediction result of two different models for ablation study on some patients among local- 
test datasets. The green area indicates ED, the yellow subregion means ET, and the red one is NE. 
(Color figure online) 
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Fig. 5. Training losses of different conditions where the basic model structure is 3D CMM-Net 
with two additional encoders. 


So, we examined two different models where one is 3D CMM-Net with only two 
additional encoder blocks and the other is 3D CMM-Net with two additional encoder 
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blocks and two HPPMs. An example result is shown in Fig. 4. In the first row of Fig. 4, 
the predicted mask of the model without HPPM has the wrong ED area indicated with 
the yellow arrow. However, in the case of using the HPPM, we can find out that the 
wrongly predicted ED sub-region was disappeared in the predicted mask. In the second 
row of Fig. 4, for the network without HPPM there is wrongly predicted NE as pointed 
by the yellow arrow. Even though there is still an error of prediction of NE for the output 
of the network containing HPPM, the size of it decreases quite a lot. 

We conducted another ablation study to find out how PPM and HPPM affect training 
loss where we used 3D CMM-Net with two additional encoder blocks as a basic model. 
As shown in Fig. 5, all training losses were dropped stably but in the case of without 
using PPM, indicated by green line, the loss was converged at a higher value than the 
rest. In case of using HPPM, orange and brown lines, a model with PPM added to the 
basic model was used. Even though all losses were dropped similarly before 10 epochs 
the training loss was dropped faster than others after 10 epochs when two HPPM were 
used. 

After releasing the validation dataset which does not have the label mask from the 
BraTS2021, we retrain our model using all the training datasets (i.e., 1,251 patients) 
including the local-test dataset. Finally, the proposed network obtained a dice score per 
class of ET: 0.7321, TC: 0.7514, and WT: 0.8743 as shown in Table 3. 


Table 2. Dice score per each class on the local-test dataset of different models 


Network Dice 

Class ET TC WT Avg 
3D CMM-Net 0.7450 0.8049 0.8076 0.7859 
3D CMM-Net 0.7502 0.8053 0.8077 0.7877 
with 2 additional encoder blocks 

3D CMM-Net 0.7556 0.8055 0.8347 0.7986 
with 2 additional encoder blocks and 1HPPM 

3D CMM-Net 0.7557 0.8060 0.8351 0.7989 
with 2 additional encoder blocks and 2HPPMs 


On the test dataset, our model obtained the Dice score of 0.7212, 0.7410, and 0.7702 
for ET, TC, and WT, respectively as reported in Table 4. 
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Table 3. Dice score and Hausdorff distance per each class on the validation dataset of proposed 
model 


Metric Dice Hausdorff (mm) 

Class ET TC WT ET TC WT 
Mean 0.7321 0.7514 0.8743 35.0074 24.6376 10.1613 
Sd 0.5987 0.3070 0.1823 101.6312 77.2684 36.5176 
Median 0.8521 0.9023 0.9320 2.2361 3.4641 2.8284 
25 quantile 0.7180 0.7068 0.8785 1.1414 2 1.7321 
75 quantile 0.9075 0.9430 0.9532 5.4312 10.8166 5.7549 


Table 4. Dice score and Hausdorff distance per each class on the test dataset of proposed model 


Metric Dice Hausdorff (mm) 

Class ET TC WT ET TC WT 
Mean 0.7212 0.7410 0.7702 31.6602 34.8666 22.8658 
Sd 0.2950 0.3174 0.2544 95.4076 97.0577 69.2853 
Median 0.8399 0.8898 0.8755 2.2361 5.4772 4 

25 quantile 0.7036 0.7047 0.7495 1.4142 3 2 

75 quantile 0.9094 0.9489 0.9240 7.0152 13.5089 11.7045 


4 Discussion 


In this work, we propose a 3D deep learning network for semantic segmentation of brain 
tumors from 3D multimodal MRI data. There are a total of three tumor classes that we 
have to segment: ET, TC, and WT, respectively. During the experiment with the local-test 
dataset, we found that ET was the most difficult class to be segmented throughout all 
local-test sets. This is due to that ET occupied the smallest part of the total tumor area 
[11]. At first, we added two more encoder blocks in order to solve this issue. Even though 
the dice score of ET is slightly increased from 0.7450 to 0.7502. However, increasing 
the number of encoders to extract deeper features causes another problem. As illustrated 
in Fig. 4, the network with adding only two additional encoders compared to the 3D 
CMM-Net incorrectly predicted the ED region or the NE part. 

Thus, we added an HPPM block between the two added encoders to solve this 
problem. The proposed network in this study can further extend the receptive field by 
adding the HPPM block and at the same time obtaining multi-scale feature maps. The 
prediction output of our model with two additional encoders and HPPMs is shown in the 
rightmost of Fig. 4. When we try to extend the number of encoders and add HPPM in the 
bottleneck of our network, the performance was enhanced. Our proposed 3D CMM-Net 
with deeper encoder and HPPM results in the dice scores of 0.7321, 0.7514, and 0.8743 
for ET, TC, and WT, respectively on the validation dataset. The results of the validation 
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data on WT were higher than the results of the local-test dataset. On the test dataset, 
our model obtained the lower Dice score than local-test and validation sets as shown in 
Table 4. However, this trend is common when looking at the winner case from 2017 to 
last year. 

We also looked into the training loss on different conditions where we could find out 
that in the case of 3D CMM-Net with two additional encoders and two HPPMs the loss 
was dropped faster than others after 10 epochs. That graph shows that HPPM helps the 
model to be learned efficiently on the given data because HPPM could extract the local 
feature and global one at the same time from the given data. 

However, there is still a limit to add infinite encoder blocks for further improving the 
segmentation performance due to the restricted GPU memory we can utilize. Recently, 
there is a trend to improve the performance by exploiting the Vision Transformer [12— 
14]. In the future, we plan to properly adopt the Vision Transformer to our proposed 
network. 
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Abstract. Glioblastomas are the most common and aggressive malig- 
nant primary tumor of the central nervous system in adults. The tumours 
are quite heterogeneous in its shape, texture, and histology. Patients that 
have been diagnosed with glioblastoma typically have low survival rates 
and it can take weeks to perform a genetic analysis of an extracted 
tissue sample. If an effective way to diagnose glioblastomas have been 
discovered through the use of imaging and AI techniques, this can lead 
to quality of life improvement for patients through better planning of 
therapy and surgery required. This work is part of the Brain Tumor Seg- 
mentation BraTS 2021 challenge. The challenge is to predict the MGMT 
promotor methylation status from multi-modal MRI data. We propose 
a multi-modal late fusion 3D classification network for brain tumor clas- 
sifcation on 3D MRI images by using all 4 different modalities (T1w, 
TlwCE, T2w, FLAIR) and also can be extended to include radiomics 
features or other external features into the network. We also then com- 
pare it against 3D classification models trained on each image modality 
on its own and then ensembled together during inference. 


Keywords: Brain tumor - Medical imaging - Multi-modal 
classification 


1 Introduction 


Glioblastoma are the aggressive malignant primary tumor of the central ner- 
vous system in adults. Patients typically have very poor prognosis, and the cur- 
rent gold standard for treatment composes of surgery, followed by chemotherapy 
and/or radiotherapy. MGMT (O[6]-methylguanine-DNA methyltransferase) is a 
DNA repair enzyme that the methylation of its promoter in newly diagnosed 
glioblastoma has been identified as a favorable prognostic factor and a predictor 
of chemotherapy response. Thus determination of MGMT promoter methylation 
status in newly diagnosed glioblastoma can influence treatment decision making. 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 
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The presence of the MGMT promoter methylation has some evidence that 
it is a strong predictor of responsiveness to chemotherapy. Therefore, it will 
introduce new treatment and management strategies that can help brain cancer 
patients to have less invasive treatment options if techniques are able leverage 
this feature. 

MRI data of different modality such as Tlw, TlwCE, T2w and FLAIR has 
been provided by the challenge to predict the MGMT promoter methylation 
status. The intrinsic features of the biological tissue contribute to its signal 
intensity on an MR image and hence image contrast. The proton determines the 
maximum signal that can be obtained from a given tissue. The T1 time of a 
tissue is the time it takes for the excited spins to recover and be available for 
the next excitation. It affects signal intensity indirectly and can be changed at 
random. It can only be contrast enhanced. Images with contrast that is mainly 
determined by T1 are called T1-weighted images (Tlw). The T2 time mostly 
determines how quickly an MR signal fades after excitation. The T2 contrast of 
an MR image can be controlled by the operator as well. Images with contrast 
that is mainly determined by T2 are called T2-weighted images (T2w). FLAIR 
is a also considered a T2-weighted technique but it dampens ventricular CSF 
signal compared to normal T2w images. 

The use of features generated by radiomics and genomics which leads to 
the term radiogenomics in model development process are also active areas of 
research in this area. Although it requires a dataset that is annotated with the 
ground truth segmentation masks of the location of the tumour in order to be 
able to extract the features from the tumor which was not provided along with 
this challenge. 

This year, BraTS 2021 training dataset consisted of 585 cases - each with 
four different 3D MRI modalities (Tlw, TlwCE, T2 and FLAIR) which are not 
rigidly aligned to the same space. The validation dataset (81 cases) is used to 
calculate the public leaderboard ranking on Kaggle. 

In this work, we describe our multi-modality fusion approach for 3D brain 
MGMT classification from multimodal 3D MRI images. 


2 Related Work 


The BraTS challenge has been ongoing for many years and has produced plenty 
of research onto the state of art for segmentation, uncertainty classification, 
survival prediction and others. For example, past iterations have investigated 
many different techniques in the area of segmentation [27]. A lot of great work 
has been possible due to this challenge and the datasets provided [2—5, 10]. 
Large quantities of annotated datasets are not as readily available in the med- 
ical imaging domain compared to other domains. Therefore, using augmentation 
techniques to generate more data has been shown to improve the performance 
of networks in [13,14]. The two papers provided a lot of ideas on data augmen- 
tation to try while manipulating the data for the challenge. GANs [6] which is 
a state of the art technique used for generating synthetic data to increase the 
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amount of data for modelling. [7] has provided a review of the use of GANs in 
medical imaging and the results have been promising. 

Radiomics [8] is the high-throughput feature extraction process that allows 
us to extract mineable data from images and the subsequent analysis of these 
data for decision support. It can contain first, second, and higher-order statis- 
tics. These data are combined with other demographics data and are mined 
with sophisticated bioinformatics tools to develop models that may potentially 
improve diagnostic, prognostic, and predictive accuracy At this point in time, 
the field of radiomics research are concentrated on the improvement of models to 
provide the most accurate possible diagnoses which will leads to better patient 
care and outcomes. It has also been used in problems relating to brain tumours 
and survival prediction such as in [9]. 


3 Method and Experiments 


3.1 Data Description 


The BraTS dataset [10] consists of retrospective brain tumor mpMRI scans 
acquired from multiple different institutions under standard clinical conditions 
although with different equipment and imaging protocols. Therefore, the imag- 
ing quality is heteregeneous due to the diverse clinical practice across different 
institutions. Inclusion criteria for the Task 2 challenge’s dataset comprised patho- 
logically confirmed diagnosis available MGMT promoter methylation status. The 
data have been updated since the previous iteration of BraTS challenge and the 
total number of cases has increased from 660 to 2,000. The MGMT methyla- 
tion status was based on the laboratory assessment of the surgical brain tumor 
specimen. 

The mpMRI scans consist of 4 different modalities acquired with various 
protocols and difference scanners from multiple institutions. 

Standardized pre-processing has been applied to all the BraTS mpMRI 
scans. Specifically, the applied pre-processing routines include conversion of the 
DICOM files to the NIFTI file format, re-orientation to a common orientation 
system such as RAI, co-registration to the same anatomical template, resampling 
to a uniform isotropic resolution (1 mm) and finally skull-stripping. The pre- 
processing pipeline is publicly available through the Cancer Imaging Phenomics 
Toolkit (CaPTk) [11] and Federated Tumor Segmentation (FeTS) [12]. Con- 
version to NIFTI strips the DICOM metadata from the images and essentially 
removes all Protected Health Information (PHI) from the DICOM headers. Fur- 
thermore, skull stripping mitigates potential facial reconstruction/recognition of 
the patient. 

For Task 2 (Radiogenomic Classification), all the imaging volumes were con- 
verted from NIFTI to DICOM files while preserving the original patient space. 
Each MRI sequence and its associated DICOM scan in the patient space are 
required for this conversion process. The DICOM scans were read as ITK images 
and the skull-stripped volume is rigidly registered to it, providing a transforma- 
tion matrix that defines the spatial mapping between the two volumes. 
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The acquired transformation matrix is applied to all skull-stripped volume 
and its corresponding segmentation labels to translate them both to that patient 
space. These transformed volumes are then passed through CaPTk’s NIFTI to 
DICOM conversion engine to generate DICOM image volumes for the skull- 
stripped image. Once all MRI sequences were converted back to the DICOM file 
format, the dataset was anonymized further using two steps involved the RSNA 
Clinical Trials Processor Anonymizer and whitelisting of DICOM files. 

The data is provided by the competition has three cohorts: Training, Vali- 
dation (Public), and Testing (Private). The training and the validation cohorts 
are provided to the participants and the participants will not have access to 
the “Testing” cohort at all times, during and after the competition. The train- 
ing dataset was sourced from 18 institutions internationally where some of the 
data comes from the Cancer Imaging Archive (TCIA) but the majority has not 
previously been made publicly available. 

The private test set included a significant proportion of cases from organi- 
zations not represented in the training dataset to simulate real-world clinical 
environment and evaluate the generalization ability of the models with this data 
obtained at different sites as revealed by the organizers after the competition 
has ended. 


3.2 Data Pre-processing and Augmentation 


Data augmentation techniques have been shown to implicity regularize and 
improve generalization of deep neural networks to unseen datasets. It is vital in 
scenarios where the amount of high-quality ground-truth data is limited because 
acquiring and annotating new data is costly and time-consuming. [13,14] both 
show that data augmentation significantly improves the performance of the neu- 
ral network through their experiments with BraTS datasets. Elastic deformations 
and brightness adjustment seem to be best combination of augmentation to be 
applied to the data. It can be useful to train the network on brain scans that are 
oriented differently so that the model does not overfit to the training data and 
this is also enabled by the fact that all subjects in BraTS have been co-registered 
to a common space. 

The types of data pre-processing and augmentation that have been performed 
on the mpMRI scans so far are as shown below. 


Data Pre-processings: 


— Perform resampling and alignment of the planes of different MRI imaging 
modalities (the planes are different even for the same patient between different 
modalities) to one reference patient 

— Create sub-volumes of 64 x 64 x 64 voxels and 128 x 128 x 128 voxels 

— Remove blank images 

— Crop to focus on regions of interest 

— Normalize and standardize intensity values 

— Apply CLAHE for histogram equalization 
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Data Augmentations: 


— Random Scaling 

— Random Rotation 

— Random Flipping 

— Random Shearing 

— Brightness adjustment 

— N4 bias field correction which has shown in work well in [5,28] 


3.3 Single Modality Classification Networks 
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Fig. 1. Efficient net compound scaling [16] 
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Fig. 2. Efficient net architecture [17] 


Our current baseline classification network is an Efficient Net which can arbitrar- 
ily scale network dimensions, such as depth, width, and resolution by performing 
a grid search to find the relationship between different scaling dimensions of the 
baseline network under a fixed resource constraint. This model scaling is the 
main idea of this network which can seen in Fig. 1. The model scaling method 
achieves a balance of scaling all dimensions of network width/depth/resolution 
by scaling each of them with a constant ratio. This scaling approach was shown 
to work well due to the idea that the input image is bigger, a network with more 
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layers and more channels to capture more fine-grained patterns on a larger image 
will needed. 

A multi objective neural architecture search (NAS) approach was used to 
develop the architecture of the network that balances the tradeoff between accu- 
racy and floating point operations. There are 8 different EfficientNets ranging 
from B1 to B7 with the baseline model being BO. We experimented with all the 
BO-B7 architectures and found that a simpler model tends to perform better 
so the results presented in this paper will be based on the BO architecture. An 
example of the underlying architecture of the baseline model is shown in Fig. 2. 
The building block of the MBConv block consists of the inverted residual blocks, 
squeeze and excitation block as well as swish activation. 

Our model is trained from scratch using a 3D version of EfficientNet imple- 
mented in Pytorch. More details can be found in the original paper [16] and the 
code for the 3D version is from [18] (Fig.3 and 4). 

Our initial approach trains a classification network per each image modality, 
then use each of them to predict the MGMT value and then ensemble their 
predictions to be used as the final predictions as can be seen in Fig. 2. 


3.4 Multi Modality Classification Network 


Our next approach is to try to take advantage of all the different MRI image 
modalities during the training process by concatenating their feature maps before 
the classification head. We also explored the opportunity concatenate other fea- 
tures that are not from the images such as DICOM metadata into the feature 
map before classification which did not have meaningful improvements to the 
model. The late fusion may lose information on the interactions between modal- 
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Fig. 3. Single modality and ensemble 
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Fig. 4. Multi-modality approach with late fusion v1 


ity but it is easy to train as well as has flexiblity to be extended and make 
predictions if one or more of the other modalities are not available. 

Combining early fusion (merging 4 different MRI modalities into a single 4 
channel image) with late fusion could also be promising. There are many different 
ways to perform multi modal fusion that can be explored and is covered in [22]. 

The next stage in our pipeline that we are planning to work on is to use 
DeepBrainSeg to predict the segmentations from each different modalites in the 
data and then extracting radiomic features from the volume of interest (morpho- 
logical, texture, histogram-based, first /second order statistic and others). This 


is shown in Fig. 5. 
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Fig. 5. Multi-modality approach with late fusion v2 
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4 Results 


Our network is implemented mainly using PyTorch [19] while the image pro- 
cessing and augmentation on the dataset provided without any external data is 
done using a combination of SimpleITK [20,21]. The model is trained on Google 
Colab Pro which provides a range of GPUs such as Nvidia K80s, T4s, P4s and 
P100s. We compared the performance of the single modality approach (ensemble 
of single modality models), multi-modality approach and ensemble of the two 
approaches. The single modality approach takes about 8h to train on Google 
Colab Pro GPUs for about 25 epochs. The multi-modality approach takes about 
24h to train on Google Colab Pro for about 8 epochs. 

We report our results on the public leaderboard (validation) dataset at the 
time of submission of this paper. Our predictions are submitted on the Kaggle 
platform alongside a notebook with inference code. We also ensembled a com- 
bination of models trained using the single modality approach and the multi- 
modality approach to test whether there will be improvement in performance 
and surprisingly it did not so we believe that more data augmentation might be 
needed for model robustness. Our best performing classification model gives an 
AUC score of 0.698 on the public leaderboard (Table 1). 


Table 1. Public leaderboard results for methods 


Method name AUC score 
Ensemble of single modality | 0.634 
Multi modal late fusion 0.698 


Ensemble of the two methods | 0.603 


Not all of our models were scored on the private testing dataset due to a 
submission scoring error on Kaggle which did not provide further information. 
Out of the models that were scored, the simpler models scored much better on 
the testing dataset where the AUC is only around 0.5-0.51 that is much lower 
than the validation dataset. Other participants has also reported much lower 
AUC on the testing dataset compared to the validation dataset. 


5 Discussion and Conclusion 


In this work, we described an initial multi-modal late fusion architecture for 
MGMT value using all four different modalities of 3D MRIs in the dataset that 
have been provided by the BraTS Challenge 2021. 

We have experimented with different approaches such as training different 
state of the art classification architectures in 2D and 3D such as ResNets and 
SE-Resnets. We also tested different hyperparameters such as learning rate (with 
and without scheduling) and the batch size but had to keep the batch size to 4 
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due to GPU memory limits. We have different datasets with different voxel sizes 
of 64 x 64 x 64 and 128 x 128 x 128 that are used in training the models. The 
current multi-modality model is only trained with the axial plane but we plan 
to also train the model on sagittal and coronal planes as well. 

Different data pre-processing and augmentation techniques were employed 
such as normalization and standardizing the intensity values in the images, 
removing blank images, resampling and align image planes of different modal- 
ities, cropping, N4 bias field correction and others. The N4 bias correction on 
MRI images seems to beneficial on some images and not all of them so further 
investigation have to be done to identify the images that will benefit the most 
from this processing. 

All of the training of the models has been done on the Kaggle Notebooks 
and mainly Google Colab Pro which have limited VM runtime of 9h and 24h 
respectively. The GPUs provided by Google Colab Pro can vary depending on 
availability as well as being outside of the user’s control and therefore hard to 
get a consistent runtime alongside a quota for GPU usage where no GPUs will 
be allocated once that limit is reached. So the current approach has not been 
fine-tuned extensively yet. Therefore, we plan to perform more in-depth fine- 
tuning of the final models and approach using Google AI Platform notebooks to 
use more powerful GPUs without runtime limit or GPU quotas. 

There’s a lot more room for improvement the current architecture to be 
extended for the remaining duration of the competition to be able to take advan- 
tage of the information available in the MRI datasets provided by the compe- 
tition as well as external datasets. One of the key ideas that we would like to 
explore is to either segment the MRI brain images provided in the challenge by 
hand or to use a model pretrained on brain tumor data to automatically segment 
the images so that we are able to perform feature extraction using radiomics or 
deep learning. If we can extract the radiomics or deep learning features for each 
modality, then we can perform feature reduction by keeping statistically signifi- 
cant and uncorrelated features before possibly fuse/concatenate them alongside 
the combined feature vector of different MRI modalities before the classifica- 
tion head. This could probably improve the performance of this multi-modality 
approach. An example of a feature extraction pipeline can be seen in Fig. 6. 

Early fusion of the 4 different image modalities into a 4-channel image and 
then using this new representation to train a classification network is also another 
possible avenue for exploration. Due to small size of training, public leaderboard 
and private leaderboard data, a more thorough exploration of data augmentation 
techniques will probably be useful to make the models more robust. 

The low generalization ability of our models was also experienced by other 
participants in the competition and was covered by the organizers of the com- 
petition in [26]. This is can be partly attributed to the small size of the dataset 
(training, validation, testing) as well as the presence of multi-institutional data 
in the testing dataset which is not present in the training dataset. Therefore, 
a simpler model and greater focus on data processing was shown to be more 
promising as can be seen with the approach that was shared by the first place 
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Fig. 6. Example of a feature extraction pipeline using radiomics or deep learning. 


winner in [25] where ensembling and complex models performed well in the 
validation dataset but not on the testing dataset. The inherent difficulty in gen- 
eralizing the model to an unseen data was illustrated in [27] where the paper 
showed a great reduction in performance when a model trained and validated 
using public data from the US to predict a different mutation in brain cancer 
(ATRX) tested poorly on testing dataset from China. 

The conclusion from participating in this challenge is that more work still 
needs to be done before application of imaging AI can be confidently used for 
radiogenomics. There is not a strong enough evidence that medical imaging 
alone can be used to predict methylation (MGMT promoter status) or genomic 
features of cancer with high confidence to deliver valuable prognostic information 
to clinicians and patients. Additional analysis also needs to be conducted on 
discrepancy between the performance observed in the challenge with literatures 
that also looked at the prediction of MGMT status such as [28] and the factors 
that lead to this discrepancy. 
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Abstract. We present a joint graph convolution - image convolution neural net- 
work as our submission to the Brain Tumor Segmentation (BraTS) 2021 chal- 
lenge. We model each brain as a graph composed of distinct image regions, which 
is initially segmented by a graph neural network (GNN). Subsequently, the tumor- 
ous volume identified by the GNN is further refined by a simple (voxel) con- 
volutional neural network (CNN), which produces the final segmentation. This 
approach captures both global brain feature interactions via the graphical repre- 
sentation and local image details through the use of convolutional filters. We find 
that the GNN component by itself can effectively identify and segment the brain 
tumors. The addition of the CNN further improves the median performance of 
the model on the validation set by 2% across all metrics evaluated. 


Keywords: Graph neural networks - Brain tumor segmentation - Deep learning 


1 Introduction 


Tumor segmentation is a cornerstone of nearly all standard tumor treatments. It is inte- 
gral for surgical and radiation planning, treatment response analysis, and longitudinal 
tumor monitoring, among other standard practices. However, manual tumor segmenta- 
tion is notoriously time-consuming and subjective, even for highly trained radiologists. 
Automatic tumor segmentation can produce such segmentations in a fraction of the 
time in a standardized, reproducible fashion. Over the past decade, the performance of 
automated biomedical segmentation methods has significantly improved across multi- 
ple tumor types, and brain tumors are no exception [7,9]. The Brain Tumor Segmen- 
tation dataset (BraTS) is the largest publicly available dataset of glioma MRIs and 
corresponding expert segmentations and has played a pivotal role in developing and 
evaluating these methods [3-6, 12]. 

The 2021 BraTS tumor segmentation challenge consists of over 2000 multi-para- 
metric magnetic resonance images (MRIs) of tumorous brain volumes (specifically, 
gliomas) imaged across a wide array of institutions. While the images are compiled 
from a number of different institutions, they are all processed using a standard pipeline, 
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and the same four modalities are available for every volume. These are T1-weighted, 
T1-weighted contrast-enhanced, T2-weighted, and Fluid Attenuated Inversion Recov- 
ery (FLAIR) modalities, all of which provide complementary information on the loca- 
tion and shape of the tumor and its compartments. The ground truth labels are generated 
using an ensemble of top-performing models from previous years and are manually 
revised by an expert neuroradiologist for all images. The challenge aims to correctly 
classify each voxel of a given brain volume as either healthy tissue, edema, enhancing 
tumor (ET), or necrotic tumor core. These tumor sub-regions can be combined into the 
whole tumor (WT) and core tumor (necrotic core + enhancing tumor, CT) to further 
evaluate model performance on gross tumor segmentation [2]. 

Our submission to the BraTS 2021 challenge is a joint graph neural network (GNN) 
- convolutional neural network (CNN) model (summarized in Fig. 1). The GNN mod- 
ule aims to partition the brain into distinct regions and predict the label of each region, 
and the CNN component refines the predictions made by the GNN. Unlike the vast 
majority of BraTS competitors in recent years [6], which exclusively perform inference 
directly on voxel data, our model instead learns and predicts primarily on a graphical 
representation of the brain. We model each brain volume as composed of small, con- 
tiguous regions and connect nearby regions using edges, forming a graph. Each graph 
node contains information summarizing the intensity information of the brain in that 
region across all four modalities, and the edges allow neighboring regions to share their 
information with each other. This formulation greatly simplifies the representation of a 
brain from millions of voxels down to only thousands of nodes, while preserving nearly 
all the information. It also enables the modeling of explicit connectivity between differ- 
ent regions of the brain and potential long-range interactions between distant regions, 
which are difficult to capture using only CNNs. We have previously developed a similar 
model composed only of a graph neural network on the 2019 BraTS dataset [13]. Here, 
we improve on our previous work by adding a shallow CNN to the end of the model, 
which smooths out the model predictions at region boundaries and provides a substan- 
tial (>2%) improvement in both median Dice score and median Hausdorff distance on 
the validation set. 
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Fig. 1. GNN-CNN Model Overview. MRI Modalities are first stacked to create one 3D Image 
with 4 channels. 1) Combined modalities are clustered into supervoxels using SLIC. 2) Super- 
voxels are converted to a graph structure such that each supervoxel becomes one graph node 
(depicted graph is greatly simplified). 3) Graph is fed through a Graph Neural Network 4) Node 
prediction outputs (more specifically, logits) are overlaid back onto the supervoxels. The original 
input image features are concatenated with re-projected node logits. 5) The result is fed through 
a 2-layer CNN which produces final predictions. 
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2 Methods 


Our GNN-CNN model is composed of two components. The core component is a graph 
neural network (GNN) [10, 14]. For a given input graph representing one patient sam- 
ple, where each node corresponds to a collection of adjacent voxels in the original MRI 
image, the GNN predicts each node’s label. Since the GNN can only predict the label of 
nodes (i.e. brain regions) atomically, its predictions are necessarily coarser than voxel- 
based predictions. This property can lead to incorrect predictions at the edges of tumor 
compartments, where created regions can contain voxels of multiple labels [13]. This 
shortcoming is especially pronounced in small tumors. Accordingly, we have added a 
second component to our model: a shallow CNN [11]. The convolutional layers receive 
both the GNN prediction logits (projected back into an image) and the original voxel 
image data. They are thus able to make fine-grained adjustments to the coarse predic- 
tions based on local voxel information. The details of the model are presented in Fig. 2. 


2.1 Graph Construction from MRI Modalities 


Both the input and the output of the GNN are required to be graph-structured data. 
Therefore, before feeding the MRI scans into our network, we transform them into 
graphs. Graphs are composed of nodes and edges, where both the nodes and the edges 
can have features associated with them. In this work, each node corresponds to one 
image region, and an edge between two nodes corresponds to spatial proximity of the 
corresponding regions. We partition the brain into regions using supervoxels. Supervox- 
els are the 3D analog to superpixels, i.e., collections of nearby pixels that share similar 
intensities. 

We construct the supervoxels using the Simple Linear Iterative Clustering (SLIC) 
algorithm [1]. SLIC uses a combination of spatial and intensity information to partition 
an image into approximately a desired number of supervoxels using K-means cluster- 
ing. While the input to SLIC is traditionally in either RGB or Lab color space, we 
find that running SLIC directly on the stacked MRI modalities still produces mean- 
ingful supervoxels. To determine the optimal hyperparameters for the SLIC algorithm, 
we perform a grid search across k, the number of supervoxels and m, the compactness 
coefficient (the weighting between spatial and intensity information), and compute the 
achievable segmentation accuracy (ASA). ASA measures how well the GNN would 
perform on a given supervoxel partitioning, given that it classifies every supervoxel 
according to the most common label of the constituent voxels. The ASA is high if there 
is a strong correspondence between supervoxel shape and tumor boundaries, resulting 
in supervoxels composed of voxels with the same label. It is low if supervoxels are 
composed of voxels with mixed labels. 

After the supervoxels are generated via SLIC, we discard those supervoxels that 
lie outside the brain volume. Of the remaining supervoxels, each is assigned a feature 
vector, a label, and a set of neighbors. The feature vector summarizes the intensities of 
the input MRIs for its comprising voxels. We empirically found that intensity quintiles 
for each modality yielded the best results. The label is the majority label (mode) of its 
constituent voxels. The neighbors of a supervoxel are all other supervoxels which are 
directly adjacent to it. A graph is then constructed where each supervoxel forms one 
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node with its associated features and label, and each supervoxel shares an unweighted 
and undirected edge with its neighbors. 


2.2 GNN Architecture 


Our graph neural network is composed of several sequential GraphSAGE-pool lay- 
ers [8] alternated with the ReLU non-linearity (Fig. 2). Each layer transforms the fea- 
tures of each node by aggregating information from that node’s neighbors, according to 
Eq. 1 

hID = o(W® . (AY || max(o(Wpoot : h®) Y v € V(u))) (1) 


where nl! ) is the features of node u at layer l, ø is a differentiable, non-linear activation 
function, W isa layer specific trainable weight matrix, Wyo.) is a global trainable 
weight matrix, || is the concatenation operator, and V (u) is the subset of nodes which 
are directly connected to u via edges, also known as the neighborhood of u. 

The input layer expects 20 features (5 quintiles for each of four modalities) and 
the output layer outputs 4 logits (one for each label). The output logits are duplicated, 
where one copy is passed directly through a loss function which backpropagates only 
through the GNN, and the other is passed through to the CNN (Fig. 2) 
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Fig. 2. Detailed view of GNN and CNN. Left: The GNN is composed of GraphSAGE layers 
alternated with a nonlinearity. Each GraphSAGE layer updates each node’s features by sampling 
neighboring nodes and aggregating the features (Eq. 1). Right: 1) The output of the GNN is 
reprojected into a 3D image by assigning each voxel the output logits of its corresponding node. 
2) Based on this reprojection, the approximate location of the tumor predicted by the GNN is 
located and cropped out. 3) The projected and cropped logits are concatenated with the image 
features for that same location. This volume is then fed through a two-layer CNN. Note that the 
output of both the GNN and CNN components have an associated loss function. 
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2.3 CNN Architecture 


The CNN consists of two convolutional layers with a 5 x 5 x 5 kernel size and a stride of 
1 (Fig. 2). The first layer has 16 filters and the second 4 (one for each label) with ReLU 
nonlinearity between the two layers. The architecture is purposefully kept simple since 
it only serves to refine the predictions made by the GNN. 

The input to the CNN is the concatenation of the GNN output logits (f = 4) and 
the input MRI modalities (f = 4) for each voxel. Therefore, the CNN receives the 
predictions of the GNN in addition to the image features, which allows it to correct the 
predictions made by the GNN. This correction is especially relevant around the edges of 
the tumor and its compartments, where the coarse predictions from the GNN can often 
result in misclassifications of strips of voxels. We feed only the tumorous tissue through 
the CNN to reduce the memory requirement and computation time. Specifically, we 
crop out a patch of the volume containing the tumor, as predicted by the GNN, and the 
CNN further refines only that patch. 


2.4 Loss Functions 


We calculate and backpropagate loss through our model at two locations. A voxel-wise 
cross-entropy loss is calculated from the output of the CNN and backpropagated only 
through the convolutional layers. This loss is unweighted as the input to the CNN has 
been cropped to the tumor-containing volume. 

A node-wise weighted cross-entropy loss is calculated from the GNN logits and 
backpropagated through the GNN. The ground truth label for each node is generated by 
finding the mode of the labels in the corresponding supervoxel. This loss is weighted 
approximately inversely to the prevalence of each label to address the class imbalance. 

We include this GNN loss function to obtain prediction logits of the nodes that can 
then be easily projected in the image space. It is crucial for the model’s performance that 
the GNN output be interpretable as predictions, so that the predicted tumorous volume 
can be located and cropped out. Furthermore, this formulation allows us to visualize 
the finer corrections that the CNN layer performs over the coarse GNN predictions (see 
Fig. 3 for example). 


2.5 Model Training 


In practice, we train the GNN and CNN sequentially rather than simultaneously to 
decrease training time. The GNN is trained for 300 epochs on mini-batches of 6 graphs, 
whereas the CNN is trained for 100 epochs using only one sample at a time. The training 
of a full model takes approximately 2 days on an 8 GB GPU. 

We used the AdamW optimizer with weight decay of 0.0001 and exponentially 
decrease learning rate according to Eq. 2 


Ire = lro * àf (2) 


where lro is the initial learning rate, e is the current epoch and \ = 0.98. We found 
that adding additional regularization, such as dropout or higher weight decay, did not 
improve performance. 
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The BraTS 2021 dataset is split into training (n=1251), validation (n=216), and 
test (n = 570) partitions. The hyperparameters for only the GNN component, i.e., GNN 
layer sizes, GNN depth, learning rate, and class weighting, were tuned using random 
search and 5-fold cross-validation on the entire training set (n= 1251). The GNN archi- 
tecture with the best average performance across the 5 folds was then integrated into 
the full hybrid model. Three architectural replicates were trained on the entire dataset 
and evaluated on the validation set. The best performing replicate was then submitted 
for evaluation on the test dataset. We report the mean and median results of the best 
performing replicate on both the validation and test sets in Sect. 3.3. 


2.6 Data Preprocessing 


The BraTS dataset MRIs are all padded to a standard shape to facilitate image-based 
processing. Since our approach is primarily graph-based and does not rely on uniform 
input sizes, we first crop each patient sample to the tightest possible bounding box 
around the brain to minimize the amount of background volume prior to supervoxel 
creation. Subsequently, we rescale each MRI to the approximate (0, 1] range by dividing 
by the 99.5 percentile of intensity values in that MRI. The raw MRI data is not collected 
in a bounded range and can vary by several orders of magnitude even between two 
images of the same modality. As such, this step normalizes the intensity values to be 
consistent across the dataset. Finally, we compute the mean and standard deviation for 
each modality across the entire training dataset (on non-zero voxels) and standardize 
each modality to have zero mean and unit variance. 


3 Results 


3.1 Hyperparameters 


The SLIC parameters with the highest achievable segmentation accuracy (ASA) were 
k = 15000 and m = 0.5. The value for m differs from that in our previous work [13] 
as our preprocessing steps have slightly changed. 

The best performing GNN model from the cross-validation phase had 6 layers with 
256 neurons each and a learning rate of 0.0005. The GNN is thus deeper and has many 
more learnable parameters than the CNN. This is a purposeful design choice to force 
the GNN to do the majority of the learning. 


3.2 Evaluation Metrics 


The performance of the models submitted to the BraTS challenge are evaluated using 
two metrics, Dice score and the 95*” percentile of the symmetric Hausdorff distance. 
Both metrics are evaluated over the whole tumor, core tumor, and active tumor sub- 
regions. Intuitively, the Dice score measures the overlap between the predictions and 
the ground truth while Hausdorff distance measures the most the predicted and ground 
truth segmentations diverge from each other. 

2TP 


Dice = . 
1ce = OTP + FP 4+ FN (3) 
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where TP, FP, and FN are the number of true positives, false positives, and false 
negatives, respectively. True positive voxels are defined as those correctly assigned as 
belonging to a specific tumor compartment. 


HD95 = 95% (d(Y, Y)||d(Y, Y)) (4) 
where d is the element-wise distance of every voxel in the first set to the closest voxel of 


the same label in the second, Y are the predicted labels of each voxel, Y are the ground 
truth labels of each voxel, and || is the concatenation operator. 


3.3 Performance 


Table 1. Mean results on validation set. 


Metric ‘Dice | HD95 

Tumor subregion WT |TC |ET (WT|TC |ET 
GNN 0.874 | 0.782 | 0.738 6.92 | 16.67 | 20.40 
GNN-CNN | 0.894 | 0.807 | 0.734 6.79 | 12.62 | 28.20 


Table 2. Median results on validation set. 


Metric Dice HD95 

Tumor subregion | WT |TC | ET WT | TC | ET 
GNN 0.906 | 0.885 0.813 | 3.46 | 3.16 | 2.45 
GNN-CNN 0.925 | 0.908 0.842 | 3.00 | 3.00 | 2.24 


The mean and median results on the validation set are given in Tables 1 and 2, respec- 
tively. On the validation set, we report both the performance of the GNN model and of 
the joint GNN-CNN model. 

The comparison of the two models shows that the addition of the convolutional 
layers to the model improves mean and median performance across both metrics in 
the whole tumor and core tumor regions, and is inconclusive for the enhancing tumor. 
In the case of ET, the CNN improves the average segmentation (better median), but 
also seems to exacerbate poor performance on outliers (worse mean). Nonetheless, the 
overall improved results indicate that the addition of the CNN can successfully correct 
misclassification errors that result from mixed-label supervoxels, even while the CNN 
architecture is very simple. Notably, the median improvement across all three subre- 
gions demonstrates that the joint GNN-CNN model is 1) better able to distinguish the 
border edema from healthy tissue, 2) better able to distinguish NET from edema, and 
3) better able to distinguish ET from NET on a typical brain. 

An example segmentation highlighting these improvements is provided in Fig. 3, 
along with two of the four input modalities. The FLAIR image provides information 
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Fig. 3. Example Predictions on Validation Brain Three slices (horizontal, coronal, and sagittal) 
of the same brain from the validation set are shown. The first row is from the Tlce modality, 
and the second is from the FLAIR modality. The third shows the GNN predictions. The fourth 
row contains the GNN predictions refined through the CNN. Ground truth segmentations are 
unavailable for the validation set. Red = edema, Blue = NET/necrosis, Yellow = ET. We observe 
that the GNN accurately identifies the tumorous region but makes slight errors in classifying the 
individual compartments. The CNN, however, can refine the predictions in greater accordance 
with the images. (Color figure online) 


on the tumor core and edema and is thus well suited for the segmentation of the whole 
tumor. The T1ce modality provides complementary information on NET/necrotic tissue 
and the enhancing tumor and is thus vital for delineation of the ET and NET subregions. 
The predictions that have been refined through the CNN (last row) are both smoother 
and correspond more closely with the shape and appearance of the tumor in the two 
modalities than the predictions made directly by the GNN (third row). 

Given its superior performance on the validation set, we chose the joint model for 
evaluation on the test set. These results are provided in Table 3. The test set consists 
of 570 images. Of these, 87 have a different orientation than the images in the train 
and validation set. Unfortunately, the challenge organizers informed us that our model 
submission was unable to produce segmentations for these 87 images. Nonetheless, to 
preserve consistency across all participants, they have been included in the aggregated 
results with Dice scores of 0 and Hausdorff distances of 300. 

On the test set, the median results approach those achieved on the validation set, 
but the mean scores fall far below the expected performance. We suspect that the dis- 
crepancy between mean and median scores is caused by the inclusion of the 87 failed 
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Table 3. Results on test set. 


Metric Dice HD95 

Tumor subregion | WT |TC | ET WT TC |ET 
Mean 0.747 | 0.680 | 0.560 | 63.15 | 72.63 | 75.74 
Median 0.911 | 0.884 | 0.703 | 3.74) 3.16) 3.31 


cases. The existence of such outliers would skew the mean more than the median scores, 
leading to the observed pattern. Nonetheless, the median results indicate that, on a typ- 
ical unseen tumor, our model is effective at locating the whole and core tumor, but has 
difficulty delineating the enhancing tumor from surrounding regions. Possible improve- 
ments to ET prediction are considered in the discussion. 


4 Discussion 


We have presented a joint GNN-CNN network for automatic brain tumor segmenta- 
tion. The GNN can produce good segmentations on its own, but struggles to accurately 
delineate exact tumor and tumor compartment boundaries due to the coarse supervoxel 
generation step. We show that this limitation can be at least partially circumvented by 
adding convolutional layers to the end of the model to smooth out predictions. While 
it is likely that a more complex CNN could further boost performance, this work aims 
to improve the feasibility of GNNs for tumor segmentation rather than to engineer an 
optimal CNN. 

A clear direction for future work is to diagnose the failure cases of our model. In 
particular, our model should be able to produce a segmentation on any volume, regard- 
less of orientation. It is likely that this issue is technical rather than a failure of the 
model to generalize, but it is difficult to identify without access to the testing data. Fur- 
thermore, it will be interesting to explore how segmentation of the enhancing tumor can 
be improved. The enhancing tumor is typically a small or set of small regions, which 
makes it inherently harder to accurately delineate with supervoxels. Perhaps a hier- 
archical segmentation scheme or more complex CNN will be able to improve model 
performance here. It has also been demonstrated by other participants of this year’s 
challenge that post-processing heuristics to remove false positive ET predictions can 
have a meaningful impact on performance. Lastly, we also aim to incorporate a soft 
Dice loss in future work to improve the predictions of the composite tumor regions, 
rather than just the individual subtypes. 
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Abstract. We apply a method from Automated Machine Learning 
(AutoML), namely Neural Architecture Search (NAS), to the task of 
brain tumor segmentation in MRIs for the BraTS 2021 challenge. NAS 
methods are known to be compute-intensive, so we use a continuous and 
differentiable search space in order to apply a DiNTS search for optimal 
fully convolutional architectures. Our method obtained Dice scores of 
0.9161, 0.8707 and 0.8537 for whole tumor, tumor core and enhancing 
tumor regions respectively on the test dataset, while requiring no manual 
design of the network architecture, which was found automatically from 
the provided training data. 


Keywords: BraTS : Deep Learning - AutoML - Neural Architecture 
Search 


1 Introduction 


Gliomas remain the most common primary brain tumors in humans [1]. They 
are characterized by different levels of aggressiveness, which directly influences 
prognosis. Due to the gliomas’ heterogeneity (in terms of shape and appearance) 
manifested in multi-modal magnetic resonance imaging (MRI), their accurate 
delineation is an important yet challenging medical image analysis task. Man- 
ual segmentation of such brain tumors is time-consuming and prone to human 
errors and biases. The process also lacks reproducibility which adversely affects 
the effectiveness of patient’s monitoring, and can ultimately lead to inefficient 
prognosis and treatment. 

The majority of manual segmentation issues could be resolved using 
computer-aided automatic or semi-automatic methods of data processing. Recent 
advances in Deep Learning (DL), mainly in convolutional neural networks 
(CNNs), have allowed the DL-based models to approach or even surpass the 
human level performance in natural image classification [2] or microscope image 
segmentation [3], given sufficient amount of training data is provided. 

Automatic brain tumor segmentation is one of the most challenging problems 
in medical image processing. Obtaining a computational model capable of sur- 
passing a trained-human-level performance would provide valuable assistance to 
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 
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clinicians and would enable a more precise, reliable, and standardized approach 
to disease detection, treatment planning and monitoring. 

Naturally, DL-based models are perfect targets for the task as long as their 
data-volume requirement is satisfied. The Brain Tumor Segmentation Challenge 
(BraTS) provides a state-of-the-art dataset of fully annotated MRI brain scans 
with corresponding segmentation masks, which is widely used through academia 
and industry. 

Having a large, high-quality dataset is only the first step for training a high- 
quality model. One still needs to carefully design a network which will take 
advantage of the data within the dataset and provide accurate predictions. This 
task usually needs a lot of experience and trial-and-error approach, which can 
be suboptimal at times. So far, the state-of-the-art models in brain tumor seg- 
mentation are based on an encoder-decoder-like architectures, with the most 
prominent example being the U-Net [4]. Indeed, U-Net-like architectures, some- 
times with modifications, have a great track record of winning the previous three 
challenges. In 2018, Myronenko et al. modified a U-Net model by adding a Vari- 
ational Autoencoder branch for regularization [5]. In 2019, Jiang et al. employed 
a two-stage U-Net pipeline to segment the substructures of brain tumors from 
coarse to fine [6]. In 2020, Isensee et al. applied the nnU-Net framework with 
specific BraTS-designed modifications regarding data post-processing, region- 
based training, data augmentation, and minor modifications to the nnU-Net 
pipeline [7]. 

It is evident that a well-designed U-Net-based architecture performs very well 
on tasks such as brain tumor segmentation. However, in most cases, there is a 
need for manual effort of an expert to design and apply required modifications to 
the baseline model. In this context, the model which won BraTS 2020, nnU-Net, 
represents a very important step in the right direction. nnU-Net represents a 
framework for training (medical) segmentation models that is able to adapt the 
model architecture and data pipeline to the given task. There are high-level rules 
imposed on the framework, but the implementation of details is automated. 

In this paper, we took the automated network architecture design approach 
to a higher level. We took advantage of a methodology of a neural network 
design called Neural Architecture Search (NAS). NAS was proposed by Zoph et 
al. [8] to automatically uncover optimal architectures contained within a given 
search space. NAS can be applied to optimize an architecture on multiple levels. 
A standard approach would be to perform a search on a topology level, which 
describes the high-level connections within the network, and cell level, which 
optimizes operations taking place at a low level (for example in particular net- 
work layers). In medical image segmentation, NAS was successfully applied in 
various approaches, such as NAS-UNet [9] or V-NAS [10]. 

The downside of the NAS algorithms is that they are both computationally 
expensive and take a long time to provide results; for example, C2FNAS [11] 
takes 333 GPU days to be trained on Medical Segmentation Decathlon [12], while 
Reinforced Learning [13] and evolutionary approaches can be even slower [14]. 
Moreover, traditional NAS algorithms suffer from the discretization gap problem, 
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which arises when a continuous representation is binarized and leads to loss of 
performance. To solve this problem, FairDARTS [5] proposed a zero-one loss to 
push the continuous representation close to binary. 

In this paper, we exploit DiNTS [15]—a novel bi-level NAS method that is 
continuous, differentiable, and integrates topology contraints during the train- 
ing. Being continuous and differentiable makes the use of gradient-based opti- 
mizers possible, that are more effective than Reinforcement Learning [13] or 
evolutionary methods [14]. The topology-aware training allows the architecture 
to converge to a solution that is feasible (providing paths from the input to 
the output) and can easily be converted to a final discrete architecture. Due to 
a specifically designed topology loss, the discretization gap is largely reduced 
compared to methods where the training is unaware of topology constraints. 


2 Methods 


2.1 Data 


The training data provided for the BraTS challenge [16-20] is a set of brain 
MRI scans along with segmentation annotations of tumor regions. For each of 
the 1,251 examples, four modalities are included, that were acquired with dif- 
ferent clinical protocols and various scanners from multiple data-contributing 
institutions. The given modalities are native (T1), post-contrast T1-weighted 
(T1Gd), T2-weighted (T2), and T2 Fluid Attenuated Inversion Recovery (T2- 
FLAIR). The 3D volumes are skull-stripped and registered to 1mm? isotropic 
resolution with dimensions of 240 x 240 x 155 voxels. 

Segmentation labels were annotated manually by one to four experts. 
Annotations comprise the GD-enhancing tumor (ET), the peritumoral edema- 
tous/invaded tissue (ED), and the necrotic tumor core (NCR). Voxels that are 
not labeled as part of the tumor are treated as background class, as shown in 
Fig. 1. 

In order for submissions to be evaluated on an online platform, 219 additional 
validation samples without associated ground truth were also released. For the 
final test evaluation, 530 cases were kept secret by the organizers. All volumes 
are provided as NIfTI files [21]. 


2.2 Pre-processing and Data Augmentation 


The MONAI open-source framework [22] was used to load and pre-process the 
brain volumes from raw NIfTI files. The four modalities were concatenated 
together along the channels dimension. An additional binary channel was added 
to identify the brain region (voxels where any modality is non-zero). 

Non-zero intensities were normalized channel-wise so that they follow a 
N(0,1) distribution and volumes were aligned using the RAS orientation. For 
training in memory-limited environments, random crops of 128 x 128 x 128 
voxels were generated. The following data augmentations were applied to reduce 
overfitting: 
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T2-FLAIR T1 Annotation 


Fig. 1. Slices of a training sample with associated ground truth. Annotations classes: 
background (blue), necrotic tumor core (orange), peritumoral edematous/invaded tis- 
sue (green), GD-enhancing tumor (purple). (Color figure online) 


— random flip around each axis independently (x, y, z) with probability 0.3. 
— random intensity scaling of [—0.1; 0.1] 

— random intensity shift of [—0.1;0.1] (brain region only) 

— random gaussian noise of standard deviation up to 0.3 (brain region only) 


When voxel interpolation was needed, bilinear was used for the inputs, and 
nearest neighbor was used for the labels. 


2.3 Differentiable Neural Network Topology Search 


During the DiNTS optimization, two aspects of the network architecture are 
searched simultaneously via gradient descent: 


1. The topology, i.e. the high-level connections between layers of various fea- 
ture scales 
2. The cells, i.e. the specific operations applied on the feature maps 


The topology search space is a multi-paths fully convolutional network con- 
taining 12 layers, each with 4 scales of feature maps (1/2, 1/4, 1/8, 1/16), as 
illustrated in Fig.2. Each feature scale is only connected to adjacent scales, 
meaning there are in total 10-12 = 120 topology connections, also called cells or 
edges. This search space is flexible and not restrained to U-shaped or single-path 
architectures like previous NAS methods [11,23]. 

For each cell independently, 5 operation blocks are considered, as shown in 
Fig. 2 (right): 


— skip connection 

— 3D convolution (3 x 3 x 3) 

— pseudo-3D convolution (3 x 3 x 1) 
— pseudo-3D convolution (3 x 1 x 3) 
— pseudo-3D convolution (1 x 3 x 3) 


Pseudo-3D refers to the sequence of two convolutions described in [24], which 
has been used in V-NAS [10]. Each operation block (except for the skip connec- 
tion) is also preceded by a ReLU non-linearity and followed by instance nor- 
malization [25] with a learnable affine transform. Cells that map to a higher 
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Fig. 2. Architecture (topology and cell) search space for DiNTS. Input and output 
stems (light blue) are fixed, while green connections are optimized during the architec- 
ture search. Figure from [15]. (Color figure online) 


or lower scale have an additional 2x upsampling or 2x downsampling respec- 
tively. Downsampling is performed with 1 x 1 x 1 convolutions with stride 2, 
while upsampling is performed with trilinear scaling followed by a 1 x 1 x 1 
convolution with stride 1. 


Architecture Search. For each fold, an architecture search was performed to 
select an optimal topology and optimal cell operations. Trainable parameters 
were split into two groups: 


— Parameters of the neural network wnet (convolutions and instance normaliza- 
tion weights and biases) 

— Parameters of the architecture waren (topology weights and internal cell 
weights) 


The training set (consisting of four folds) was partitioned equally into two 
subsets, train_net and train_arch. The first subset was used to train the network 
weights, while the second was used to train the architecture weights. 

During this search, the stem cell at scale 1 had 16 filters and this number 
was doubled each time the spatial size was decreased by half. All architecture 
and network weights were initialized randomly as in [15]. 

During a warm-up period, only wnet was updated using the train_net parti- 
tion. After this warm-up, both train_net and train_arch were iterated on simul- 
taneously, to update both wnet and warch- 

The loss function used to optimize wnet was an even mix of the cross-entropy 
loss and the multi-class smoothed Sgrensen-Dice loss. This is called the seg- 
mentation loss Lseg. The use of the Dice loss helps mitigate the effect of class 
imbalance as shown in [10]. Following Isensee et al. [7], we trained on the nested 
classes used for evaluation instead of the raw provided classes. 

Warch Was optimized using a loss function that integrates, in addition to Lseg, 
a topology loss £;, as well as losses La and £, to encourage the binarization of 
the architecture weights, as introduced in [15]. 

The architecture loss function is then Larch = Lseg + t/tan * (Ltp + La + Ln), 
where t/ta represent the progress of the architecture search, so that the weight 
given to topology losses is linearly increased with time. 
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Pruning. Once architecture weights warch are found, a discretization step con- 
verts the continuous weights to binary ones, selecting a topology that prunes 
paths of low importance. This pruning must be done carefully, as to not create 
infeasible paths, i.e. paths where a node has an input but no output, or has an 
output but no input. 

This step is performed by maximum likelihood estimation together with 
Dijsktra algorithm, as described in [15]. 

For selecting operations of cells, the operation with the largest weight is 
picked and the others are discarded. 

Because the training was aware of pruning constraints via the topology loss, 
the discretization gap, i.e. the accuracy difference between the continuous and 
the discrete architecture, is reduced. 


Retraining. For each fold, the selected discrete architecture was retrained from 
scratch on the remaining folds (train_net and train_arch together), this time only 
updating network weights wnet using the Lseg loss. The number of channels was 
also increased compared to the architecture search step (32 for the stem at 
scale 1). 

An initial warm-up period was used to raise to learning rate up to the selected 
value. During the rest of the training, the learning rate was decayed with a step 
schedule. 


Ensemble. Final predictions on the test and validation sets are obtained by 
combining the predictions of the five retrained models in order to reduce variance. 

Each model predicts a probability map using sliding window inference, where 
overlapping windows are blended using a gaussian kernel giving more weight 
to the center of the window. Probabilities predictions of the models are then 
averaged, and the class with the highest probability is picked for each voxel. Test- 
time augmentation was also used, where predictions for the 8 possible volumes 
flips were averaged. The resulting segmentation map is then saved in the NIfTI 
format with the same alignment as the input volumes. 


2.4 Experimental Setup 


The training dataset was split into 5 folds so that 5 models could be trained on 
4 folds each, and evaluated on the remaining fold. Each fold contained either 
250 or 251 examples, so each model was trained on around 1,000 examples. 

A PyTorch [26] implementation of DiNTS was used. Training and inference 
were performed inside the NVIDIA NGC PyTorch 21.07 Docker container, allow- 
ing for the full encapsulation of dependencies, reproducible runs, as well as easy 
deployment on any system. Training and search runs were performed using Mixed 
Precision [27] in order to speed up the model and save memory. The architecture 
search and retraining were performed on a NVIDIA DGX-1V (8x V100 32 GB) 
system. 
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Hyper-parameters for the architecture search include a batch size of 1 per 
GPU (8 in total), the Adam optimizer for both warch and wnet, with a learning 
rate of 8 x 1074, weight decay of 4 x 1075 and betas of (0.9, 0.999) for wnet, and 
a learning rate of 1 x 1073, no weight decay, and betas of (0.5, 0.999) for warch- 
Architecture search warm-up lasted 10k steps, and retraining warm-up lasted 1k 
steps. A full search took 30k steps, while a retraining was 31k steps (warm-up 
included for both). 

For each fold, the architecture search took around 210 GPU-hours on a sys- 
tem down-clocked to 160 W per GPU, and used 26 GB of GPU memory. Retrain- 
ing took 140 GPU-hours on the down-clocked system, and used 23 GB of GPU 
memory. In total, 5 - (210 + 140) = 1,750 GPU-hours were spent in the full 
pipeline, excluding the prediction on validation samples. 


3 Results 


3.1 Selected Topologies 


The topologies resulting from the architecture searches are multi-paths and 
dense, using a mixture of all five proposed cell operations. The memory loss, 
that DiNTS can add to the topology loss in order to encourage light networks, 
was not used here in order to maximize the model capacity and accuracy. 

An example of such a topology is illustrated in Fig.3. We can observe two 
horizontal pathways through the maximum feature scale (1/2) and the minimum 
scale (1/16). Around 25% of the cells were pruned during the discretization step. 


Fig. 3. Example of a selected architecture after optimization (fold 0). Numbers on the 
edges represent the type of cell (0: skip, 1: 3 x 3 x 3 conv, 2: 3 x 3 x 1 P3D,3:3x1x3 
P3D, 4: 1 x 3 x 3 P3D) 


3.2 Quantitative Results 


Table 1 shows the best results obtained by the DiNTS method on the test data 
held by the BraTS organizers, as well as the cross-validation on the training 
data. 


3.3 Qualitative Results 


Score summaries like the Dice and Hausdorff distance provide a good way to 
compare models for a challenge, but in order to understand more precisely the 
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Table 1. Scores obtained on the test data (via the Synapse platform) and on our cross- 
validation folds. ET stands for enhancing tumor, TC stands for tumor core (ET+NCR), 
and WT stands for whole tumor (ET+ED+NCR). 


Dice 
ET TC WT 
Our submission | 0.8537 | 0.8707 | 0.9161 


CV (average) | 0.8480 | 0.8856 | 0.9048 
CV (fold 0) 0.8435 | 0.8867 | 0.9070 
V (fold 1) 0.8508 | 0.8788 | 0.9006 
V (fold 2) 0.8448 | 0.8897 | 0.8986 
V (fold 3) 0.8552 | 0.8886 | 0.9076 
V (fold 4) 0.8460 | 0.8844 | 0.9101 


strengths and fallbacks of predictors, manual inspection of prediction is some- 
times necessary. 

Figure 4 shows an example where the DiNTS method successfully captured 
all relevant tumor regions identifiable from the four input modalities. 

Figure 5 shows however a case where DiNTS failed to predict accurately the 
whole tumor region. It over-segmented seemingly healthy tissue as part of the 
peritumoral edematous/invaded tissue (top left of the tumor). 


Fig. 4. Example of an accurate prediction (right) overlayed on the FLAIR modality 
(validation dataset). 
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Fig. 5. Example of a problematic prediction (right) overlayed on the FLAIR modality 
(validation dataset). 


4 Discussion 


While the DiNTS method managed to get reasonable results with no manual 
tuning of the network architecture and very little hyperparameter search, it still 
suffers from some drawbacks that will need to be addressed in future research. 

First of all, the selected architectures can be obscure, especially when they are 
dense like in our case. It is hard to explain why the algorithm chose specific cells 
and operations instead of others. There is no clear encoder-decoder architecture 
like we see in manually created topologies, and the pattern of cell operation 
seems arbitrary. Existing work [28] combines NAS methods with network design 
to yield efficient and performant architectures. 

Then, even if DiNTS and differentiable NAS provide major improvements 
on this side, NAS methods are still relatively expensive to train. One must have 
access to multi-GPU systems in order for the training time to be reasonable. 
Each architecture fold takes around 9 GPU-days to compute, which can be quite 
expensive using existing cloud platforms. This would mean spending around 800$ 
per search using AWS, and 550$ per search using Google Cloud (on-demand 
pricing for NVIDIA DGX-1V 32 GB). This can prevent researchers with limited 
resources to apply these techniques effectively. Hopefully, we currently observe 
a downward trend of cloud computing prices, as the hardware becomes more 
available. 


5 Conclusion 


This paper presented our participation to the BraTS 2021 challenge. We explored 
the use of Automated Machine Learning (AutoML) using an efficient differen- 
tiable Neural Architecture Search to segment tumor regions out of brain MRI 


Brain Tumor Segmentation Using Neural Network Topology Search 375 


scans. No manual tuning of the network architecture was needed, limiting human 
biases and labour. 


The search of an optimal architecture required a low amount of researcher- 


hours, while still using a significant amount of GPU-hours. In the spirit of nnU- 
Net, this work is a step towards a fully-automated system that would be able 
to perform well on any input dataset that is presented to it, underpinning the 
democratization of Deep Learning on medical data. 
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Abstract. Gliomas are the most common type of primary brain tumor, 
and high-grade gliomas are typically treated using a combination of 
chemotherapy, radiation therapy, and surgical excision. For the latter 
two therapy options, precise knowledge about the location of the tumor 
and its components is required, which can be obtained using MRI scans. 
Manually labeling the tumor area in those 3-dimensional images is a 
tedious and time-consuming task, hence major efforts have been made to 
provide automated segmentation. We present our solution to the BraTS 
2021 challenge Task1, where we segment gliomas in MRI scans using a 
SegNet-based approach, achieving competitive and stable performance 
across tumor types and components. Compared to previous solutions 
using UNet architectures, our model achieves improved segmentation of 
the peritumoral edema and comparable performance for the other classes 
while reducing the number of parameters. 


Keywords: Brain tumors - Deep learning - Segmentation 


1 Introduction 


1.1 Gliomas 


Gliomas are tumors arising from supporting cells of the brain and represent the 
most common form of primary brain tumors. Several distinct entities can be distin- 
guished based on their cells of origin and their malignity, with the majority arising 
from astrocytes (astrocytomas). Among these, glioblastomas (GBMs), also known 
as grade IV astrocytomas, are the most common brain tumors while being associ- 
ated with the worst clinical outcome. Without treatment, the median survival of 
patients is as low as three months, which can be extended to 15 months through 
combined treatment with chemotherapeutic agents, radiation, and surgery [10]. 
Due to the extensive capacity of GBMs to invade the surrounding healthy tis- 
sue, as well as the need to destroy as many cancer cells as possible while leaving 
healthy brain tissue intact, it is vital to obtain a precise localization of the tumor 
and its sub-regions [18]. At the core of the tumor, there is typically a necrotic zone, 
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caused by nutrient and oxygen starvation in fast-growing tumors (here abbreviated 
by NCR) [16]. This zone is surrounded by living and proliferating tumors cells, 
the enhancing tumor (ET). As GBMs compromise the integrity of the blood-brain 
barrier, the tumor is surrounded by an edema, caused by extravasation of fluid 
from leaky blood vessels in the tumor’s vicinity (WT) [19]. These different tumor 
compartments can be distinguished by medical imaging. The gold standard for 
this is multi-modal magnetic resonance imaging (MRI), a technique which deliv- 
ers images highlighting different structures within soft tissues. 


1.2 Segmentation 


TICE 


Fig. 1. Four different modalities of MRI scans used for the segmentation. Top row: 
modalities alone. Bottom row: superimposed ground truth for segmentation with the 
peritumoral edema shown in blue, enhancing tumor in red, and necrotic tumor core in 
green. (Color figure online) 


Segmentation refers to the task of identifying regions in an image that belong 
to a certain class, e.g. tumor, healthy tissue, and background. In contrast to 
object detection, no attempt is made to separate bordering areas belonging to 
the same class but different entities thereof. Thus, the typical output of a segmen- 
tation is a so-called segmentation map, a tensor of the same size as the original 
image in terms of spatial dimensions, but with one or several channels indicat- 
ing the presence of certain mutually exclusive or potentially overlapping fea- 
tures, respectively [20]. Due to this relationship of the input and output images, 
UNet architectures have been exceedingly successful [13]. These convolutional 
neural networks correspond to an autoencoder-like structure with skip connec- 
tions between the corresponding encoder and decoder blocks, concatenating the 
encoder weights to the decoder ones while restoring the image size through trans- 
posed convolutions. The skip connections thus enable the preservation of spatial 
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information [17]. Because max-pooling on the encoder is reversed using trans- 
posed convolutions, some localization information about maxima is lost during 
the process. Furthermore, transposed convolutions need to be learned, adding 
parameters to the model. SegNet architectures aim to alleviate this problem by 
using unpooling layers instead of transposed convolutions. In these models, the 
indices of the maxima identified during the max pooling are retained and passed 
to the unpooling layers, perfectly preserving the localization of the maxima and 
improving the resolution of the segmentation map [3]. 

For the past ten years, the annual Brain Tumor Segmentation Challenge 
(BraTS) has addressed the task of segmenting brain tumors and their sub- 
structures from MRI scans, reflecting the advances in the field of (medical) 
image processing during that time and providing large, well-annotated data sets 
for researchers to use. 


2 Methods 


2.1 Data Sources 


Data used in this publication were obtained as part of the RSNA-ASNR-MICCAI 
Brain Tumor Segmentation (BraTS) Challenge project through Synapse ID 
(syn25829067) [4-7,14]. 3D NIfTI images of size 155 x 240 x 240 (Depth x 
Height x Width) with one channel for each of the four modalities were used as 
input to the model, while training labels were provided as single-channel NIfTI 
with integer class labels. Example images are shown in Fig. 1. 


2.2 Preprocessing 


The data obtained were already skull-stripped, scaled, and cropped as described 
previously. Our implementation of a SegNet was based on the nnUNet frame- 
work, so the additional preprocessing corresponded to that described in [13]. 
Importantly this includes the cropping of the original input size of 155 x 255 x 255 
to 128 x 128 x 128 (see Fig. 2). 


2.3 Network Architecture 


Table 1. Parameter counts for nnUNet and nnSegNet model architectures. 


nnUNet nnsegNet 
Number of parameters | 31,198,176 | 27,663,648 
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Fig. 2. Depiction of the UNet and SegNet architectures used. Both network types are 
largely equivalent, with the exception of transposed convolutions in the decoder of the 
UNet being replaced by unpooling layers in the SegNet (green arrows). (Color figure 
online) 


The proposed nnSegNet architecture is designed to be an efficient deep convo- 
lutional neural network for pixel-wise semantic segmentation. It is based on the 
nnUNet framework [13]. 

The encoder topology of the nnSegNet consists of five max pooling operations 
with kernel size of 2x2 x2 and a stride of 2 x2 x2. The indices of the max pooling 
operation are stored and later used in the unpooling operation. The unpooling 
operation computes a partial inverse of the max pooling operation and therefore 
allows to recreate the feature map size of the corresponding encoder step. For the 
unpooling operation we used a kernel size of 2 x 2 x 2 and stride of 2 x 2 x 2. The 
feature map of the corresponding encoder step and the unpooled feature map 
are concatenated in the decoder part of the network. For a visual comparison of 
both architectures, see Fig. 2. 


2.4 Training 


For training, we used the hyperparameters automatically determined by the 
nnUNet framework [13]. Every network architecture was trained for 1,000 epochs 
with 250 iterations for each epoch. The training was performed on a NVIDIA 
A100 40G GPU and took 37.5h. The initial learning rate was set to 0.01 and a 
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polynomial learning decay was used. As an optimizer we used Stochastic gradient 
descent with a momentum of 0.99 and weight decay of 3 x 10~°. To train the 
model we used deep supervision on auxiliary outputs of different depths of the 
network (see Fig. 2). 

As a loss function we used a sum of the dice coefficient and the binary cross 
entropy loss on all auxiliary outputs. The final loss was computed as the weighted 
sum of all auxiliary losses. 


2.5 Postprocessing 


To obtain the final prediction we used the computed softmax for every class in a 
sequential pattern. First, the output mask for the WT class was generated with 
a given threshold. Second, the output mask for the NCR. class was generated 
with a given threshold and the previous mask was overwritten. Finally, this step 
was repeated for the ET class. 

During training the thresholds for all classes were set to 0.5. To improve the 
final model performance, we optimized the softmax thresholds for every class 
(see Fig. 4) on a 5-fold cross validation. For every fold we sampled 1000 different 
threshold combinations. Due to the interdependence of the different classes and 
the resulting large search space, we decided to use a Tree-structured Parzen 
Estimator Approach (TPE) to optimize the thresholds [9]. Finally, we used the 
median thresholds for every class from all five folds. This optimization was done 
separately for the nnSegNet, nnUNet and their Ensemble. We used optuna [2] 
as optimization framework. 

Our second postprocessing step, is aimed to reduce false positive ET predic- 
tions. Therefore, we used two thresholds on the predicted ET volume and the 
predicted NCR volume. These thresholds were computed with a decision tree 
algorithm with a depth of two. Finally, the ET label was suppressed in the final 
prediction an replaced by the next highest softmax value, if the predicted ET 
volume was < 129.5 and the NCR volume was > 16954.0 (see Fig. 5). 


2.6 Metrics 


Several metrics were used in the evaluation, specifically the Dice Coefficient [11], 
sensitivity, specificity, and Hausdorff95 (HD95) distance [15]. The first three 
are calculated from overlap of sets, with TP indicating true positive, FP false 
positive, TN true negative, and FN false negative: 


: 2TP 
Dice = oT P 4 FP + FN () 
TP 
Sensitivity = TP+ FN (2) 
TN 
Speci ficity = (3) 


TN+FP 
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The Hausdorff95 distance is defined by the supremum sup of the distance d 
between two sets X and Y, made more robust to outliers by reporting the 95th 
percentile rather than the maximum: 


di95 = Po5 {sup d(x, Y), sup d(X, »} (4) 
rex yeY 


3 Results 


3.1 Network Architecture 


To gauge the performance of existing model architectures, we first trained a 
UNet using the nnUNet architecture, a frontrunner in previous segmentation 
challenges. On the training dataset with 5-fold cross-validation, this network 
already outperformed top solution of the past years, likely due to the increased 
size of the training dataset (see Fig. 3 and [1,8]). 
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Fig. 3. Performance metrics for different model architectures. Per-image metrics are 
shown for images from the training dataset, using 5-fold cross-validation. Ensemble 
denotes a combination of nnSegNet activations for the peritumoral edema and nnUNet 
activations for the other classes. Horizontal lines indicate the median, boxes depict the 
inter-quartile range (IQR), whiskers extend to 1.5x the IQR. 


As with relatively few exceptions the predicted segmentation masks were very 
close to the ground truth labels, we hypothesized that performance gains could be 
achieved via smaller adjustments to the network and postprocessing rather than 
through rewriting the entire architecture. Specifically, we aimed to increase the 
resolution of the predicted masks while reducing the risk of overfitting. To this 
end, we replaced the transposed convolutions in the decoder part of the network 
with unpooling layers, leading to a SegNet architecture (here called nnSegNet 
due to its integration into the nnUNet framework) which should result in a better 
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conservation of location as shown in [3]. Simultaneously this leads to a reduction 
of the number of parameters in the model by roughly 10% (see Table 1). This 
resulted in a slight stabilization of the metrics characterized by a lower standard 
deviation and inter-quartile range of the individual sample scores (see Fig. 3). 
On the public validation dataset, the performance of the nnSegNet was slightly 
decreased for the tumor core and enhancing tumor classes, but increased for the 
peritumoral edema (see Table 2), with mean scores slightly favoring the nnUNet 
architecture. An ensemble method combining the predictions of the nnUNet for 
TC and ET and the nnSegNet for WT did not appear to achieve an overall 
increase in performance on the training set with cross-validation (see Fig. 3). 
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Fig. 4. Threshold optimization. Results of the parameter tuning of thresholds for 
assigning classes from the softmax activations of the model output. Thresholds are 
shown on the horizontal axis, Dice Coefficients on the vertical axis. Grey levels indi- 
cate the density of trials where a corresponding score was achieved. Top row: UNet 
architecture. Middle row: SegNet architecture. Bottom row: Ensemble with WT from 
nnSegNet and the other classes predicted from the nnUNet activations. 
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3.2 Postprocessing 


To further improve the performance, we decided to fine-tune the class assign- 
ment. As class labels were non-overlapping, it would be possible to use the 
argmax along the output channels. However, as the classes are nested and pre- 
dicted with different sensitivities, we took a different approach, sequentially pre- 
dicting each class to be present if the softmax surpassed a certain threshold (see 
Sect. 2.5) [13]. To improve the performance of our model, we decided to tune 
the threshold for each class, revealing that the previously used threshold of 0.5 
is not ideal for all classes (see Fig. 4). For ensembles of nnSegNet and nnUNet, 
this approach was used with the softmax activations for TC and ET from the 
nnUNet and WT from the nnSegNet, but this did not lead to a major improve- 
ment (see Figs. 3 and 4). In Fig.5 an overview of the postprocessing pipeline is 
visualized. 


Training Postprocessing I Postprocessing II 
ı Decision tree optimization 


Trained nnSegNet TPE Algorithm i 
Thresholds: | Optimized thresholds: 


WT = 0.5 E IWT = 0.47 E ! Keep ET 


ET =0.5 EME ET = 0.49 mm 
NC = 0.5 m NC = 0.67 ME 


Fig. 5. Postprocessing steps of the nnSegNet added to the nnUNet postprocessing. 
After training, the first postprocessing step is to optimize the prediction thresholds 
using the TPE algorithm. The second postprocessing step is to apply a decision tree 
and eventually drop the ET label and replace it with the next most likely label based 
on the predicted softmax. 


3.3 Missing Classes 


For the ET class, we observed a drastic difference of the mean HD95 between 
the training and validation sets. Upon closer inspection, we found this to be 
caused by several outliers where ET area was predicted, but was absent from 
the ground truth annotation. In the metrics calculation for the BraTS challenge, 
these cases were scored with a HD95 of over 370, resulting in a strong skewing of 
scores considering the mean score for correctly predicted images was below 10. 
To address the erroneous prediction of ET voxels in images where no enhancing 
tumor was present, we made use of the observation that these images typically 
had very few voxels predicted to be ET, allowing us to set a threshold below 
which we could reassign the voxels to one of the other two tumor classes or 
background. 
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Table 2. Performance metrics for all classes for various model architectures. pp indi- 
cates models subjected to threshold tuning and removal of small ET predictions. 


Model Dice WT | Dice ET | Dice TC | HD95 WT | HD95 ET HD95 TC 
nnUNet 0.926 0.816 0.885 3.793 22.873 7.425 
nnSegNet 0.928 0.808 0.877 3.483 26.232 7.571 
Ensemble 0.926 0.808 0.885 3.792 26.231 7.406 
nnUNet pp | 0.926 0.848 | 0.886 3.783 9.286 5.813 
nnSegNet pp | 0.928 0.847 0.878 3.470 9.335 7.564 
Ensemble pp | 0.926 0.843 0.887 =| 3.783 9.432 5.789 


Still, overall scores were strongly influenced by poorly performing outliers, as 
can be seen from the discrepancy of mean and median score values of the nnSegNet 
(see Table 3). Similar effects were observed for the other model architectures. 


Table 3. Mean vs. median performance metrics for the nnSegNet. 


Dice WT Dice ET | Dice TC | HD95 WT | HD95 ET HD95 TC 
Mean | 0.928 0.847 0.878 3.470 9.335 7.564 
Median | 0.948 0.903 0.944 2.236 1.414 1.732 


4 Discussion 


4.1 Performance 


While the models presented achieve a markedly increased performance com- 
pared to previous top competitors, this is in large part due to the increase in 
samples (660 in 2020, 2,000 in 2021), highlighting the importance of large, well- 
annotated datasets for machine learning. The nnUNet without the proposed 
post-processing, which is identical to an architecture used for BraTS 2020 [12], 
already performed exceedingly well, with minor performance gains through the 
post-processing for the WT and TC classes. The nnSegNet architecture did not 
improve overall results, but achieved a comparable performance with only 88% 
of the parameters of the nnUNet. 

Scores below a Dice Coefficient of 0.9 or a HD95 distance of above 5 were 
mostly attributable to outliers, indicating that the general performance of the 
models is exceedingly good. This is also reflected in the median scores of the 
nnSegNet, and is especially apparent in the HD95 there. The median HD95 are 
in the range of 2, which likely falls within the range of disagreement between 
human specialists. Penalizing the incorrect presence of even a single voxel of e.g. 
ET in an image where it is absent with a distance of over 370 gives these outlier 
cases an outsized influence. Since this is the largest possibility for improvement 
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of the segmentation metrics, addressing these outliers above all else is the most 
promising way to improve the performance in future challenges. To improve the 
usability in a real-world setting, however, a closer look at common misclassifica- 
tion of e.g. anatomical structures is needed. This is currently disincentivized in 
challenges, but could be achieved by penalizing the classification of areas such 
as the choroid plexus as part of the tumor. 

Previous BraTS challenges already focused on tasks such as uncertainty eval- 
uation, which are a promising way forward in terms of real-world usability. While 
most images could be segmented with near-perfect accuracy, it would be ben- 
eficial to direct a human supervisor towards the cases where performance was 
likely poor, both on a per-image and a per-region basis. Given the exceedingly 
large dataset provided and the outstanding performance of models submitted 
by the contestants, it would certainly be of interest to revisit the uncertainty 
challenge again. 


4.2 Further Model Size Reduction 


Typically CNNs require large quantities of memory and processing time to be 
deployed successfully. This often makes CNNs difficult to use in real life applica- 
tions. To tackle this issue we explored the influence of the number of parameters 
on the model performance. 

The presented nnSegNet reduces the number of trainable parameters from the 
original nnUNet already from 31.2 x 10° parameters to 27.6 x 10° parameters. 
This is achieved by replacing the transposed convolutions in the nnUNet by the 
unpooling operation. We continued reducing the number of trainable parameters 
by reducing the number of stacked convolutional layers from two to one (see 
Fig. 2). The resulting model has 17.4 x 10° trainable parameters and reduces the 
multiply-accumulate operations in a forward pass more than two fold. The mean 
Dice Coefficient of this small model dropped from 0.928 to 0.889. 

Finally, we would like to balance between the model size and model per- 
formance to improve the CNNs capabilities in real life applications. This again 
could very well be tuned by the quantification of uncertainty, specifying the 
acceptable level of quality drop in an application specific manner. 


Acknowledgments. The authors thank Georgios Nikolis for his support with regard 
to the HPC infrastructure and Foo Wei Ten and Dongsheng Yuan for fruitful discus- 
sions. We are grateful to David Kaul for providing a clinical perspective on brain tumor 
segmentation. We also want to thank all members of the Berlin Institute of Health and 
especially Prof. Eils and Prof. Conrad for their support. 

This work was supported by the German Ministry for Education and Research 
(BMBF, junior research group “Medical Omics”, 01ZZ2001). 


Brain Tumor Segmentation with SegNets 387 


References 


10. 


Ii 


12. 


13. 


14. 


15. 


16. 


. Brats2020 validation phase leaderboard. https://www.cbica.upenn.edu/BraTS20/ 


IboardValidation.html. Accessed 21 Aug 2021 

Akiba, T., Sano, S., Yanase, T., Ohta, T., Koyama, M.: Optuna: a next- 
generation hyperparameter optimization framework. In: Proceedings of the 25rd 
ACM SIGKDD International Conference on Knowledge Discovery and Data Min- 
ing (2019) 

Badrinarayanan, V., Kendall, A., Cipolla, R.: SegNet: a deep convolutional 
encoder-decoder architecture for image segmentation (2016) 

Baid, U., et al.: The RSNA-ASNR-MICCAI BraTs 2021 benchmark on brain tumor 
segmentation and radiogenomic classification (2021) 

Bakas, S., et al.: Segmentation labels and radiomic features for the pre-operative 
scans of the TCGA-GBM collection. Cancer Imaging Arch. Nat. Sci. Data 4, 
170117 (2017) 

Bakas, S., et al.: Segmentation labels and radiomic features for the pre-operative 
scans of the TCGA-LGG collection. Cancer Imaging Arch. 286 (2017) 

Bakas, S., et al.: Advancing the cancer genome atlas glioma MRI collections with 
expert segmentation labels and radiomic features. Sci. Data 4(1), 170117 (2017). 
https://doi.org/10.1038/sdata.2017.117, https: //doi-org/10.1038/sdata.2017.117 
Bakas, S., et al.: Identifying the best machine learning algorithms for brain tumor 
segmentation, progression assessment, and overall survival prediction in the brats 
challenge (2019) 

Bergstra, J., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter 
optimization. In: Shawe-Taylor, J., Zemel, R., Bartlett, P., Pereira, F., Wein- 
berger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24. 
Curran Associates, Inc. (2011). https://proceedings.neurips.cc/paper/2011/file/ 
86e8f7ab32cfd12577bc2619bc635690-Paper.pdf 

Brodbelt, A., Greenberg, D., Winters, T., Williams, M., Vernon, S., Collins, 
V.P.: Glioblastoma in England: 2007-2011. Eur. J. Cancer 51(4), 533- 
542 (2015). https://doi.org/10.1016/j.ejca.2014.12.014, https: //www.sciencedirect. 
com/science/article/pii/S0959804915000039 

Dice, L.R.: Measures of the amount of ecologic association between species. Ecol- 
ogy 26(3), 297-302 (1945). https://doi-org/10.2307/1932409, https://esajournals. 
onlinelibrary.wiley.com/doi/abs/10.2307/1932409 

Isensee, F., Jaeger, P.F., Full, P.M., Vollmuth, P., Maier-Hein, K.H.: nnu-Net for 
brain tumor segmentation (2020) 

Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H.: nnu-Net: 
a self-configuring method for deep learning-based biomedical image segmentation. 
Nat. Methods 18(2), 203-211 (2021). https: //doi.org/10.1038/s41592-020-01008- 
Zz 

Menze, B.H., et al.: The multimodal brain tumor image segmentation benchmark 
(BraTs). IEEE Trans. Med. Imaging 34(10), 1993-2024 (2015). https://doi.org/ 
10.1109/TMI.2014.2377694 

Pompeiu, D.: Sur la continuité des fonctions de variables complexes. In: Annales 
de la Faculté des sciences de Toulouse: Mathématiques, vol. 7, pp. 265-315 (1905) 
Rong, Y., Durden, D.L., Van Meir, E.G., Brat, D.J.: ‘Pseudopalisading’ necrosis in 
glioblastoma: a familiar morphologic feature that links vascular pathology, hypoxia, 
and angiogenesis. J. Neuropathol. Exp. Neurol. 65(6), 529-539 (2006). https://doi. 
org/10.1097 /00005072-200606000-00001 


388 N. Jabareen and S. Lukassen 


17. Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomed- 
ical image segmentation (2015) 

18. Weller, M., et al.: EANO guidelines on the diagnosis and treatment of diffuse 
gliomas of adulthood. Nat. Rev. Clin. Oncol. 18(3), 170-186 (2021). https://doi. 
org/10.1038/s41571-020-00447-z 

19. Wolburg, H., Noell, S., Fallier-Becker, P., Mack, A.F., Wolburg-Buchholz, K.: 
The disturbed blood-brain barrier in human glioblastoma. Mol. Asp. Med. 
33(5), 579-589 (2012). https://doi.org/10.1016/j.mam.2012.02.003, https://www. 
sciencedirect.com/science/article/pii/S0098299712000180, water Channel Proteins 
(Aquaporins and Relatives) 

20. Yang, R., Yu, Y.: Artificial convolutional neural network in object detection 
and semantic segmentation for medical imaging analysis. Front. Oncol. 11, 
573 (2021) https://doi.org/10.3389/fonc.2021.638182, https://www.frontiersin. 
org/article/10.3389 /fonc.2021.638182 


S 


Check for 
updates 


Residual 3D U-Net with Localization 
for Brain Tumor Segmentation 


Marc Demoustier'®, Ines Khemir!®, Quoc Duong Nguyen! 9 ©, 
Lucien Martin-Gaffé!®, and Nicolas Boutry?® 


1 EPITA Majeure Santé, 94270 Le Kremlin-Bicétre, France 
quoc-duong .nguyen@epita.fr 
? EPITA Research and Development Laboratory (LRDE), 
94270 Le Kremlin-Bicétre, France 


Abstract. Gliomas are brain tumors originating from the neuronal sup- 
port tissue called glia, which can be benign or malignant. They are consid- 
ered rare tumors, whose prognosis, which is highly fluctuating, is primarily 
related to several factors, including localization, size, degree of extension 
and certain immune factors. We propose an approach using a Residual 3D 
U-Net to segment these tumors with localization, a technique for centering 
and reducing the size of input images to make more accurate and faster 
predictions. We incorporated different training and post-processing tech- 
niques such as cross-validation and minimum pixel threshold. 


Keywords: Brain tumor segmentation - Deep learning - Convolutional 
neural networks - Residual 3D U-Net 


1 Introduction 


Gliomas or glial tumors are all brain tumors, benign or malignant, arising from 
the neuronal support tissue or glia. They are rare tumors, whose prognosis, which 
is extremely variable, is mainly related to several factors, including location, size, 
degree of extension and certain immune factors. 

The average survival time is from 12 to 18 months. Brain tumor diagnosis 
and segmentation are difficult, particularly using manual segmentation. 

In addition, medical image annotation experts have to manually annotate 
tumor segmentation, which is time consuming and difficult. Automatic segmen- 
tation of tumors allows for better diagnosis and treatment planning. 

Nowadays, deep learning represents the most effective technology for many 
tasks such as segmentation, tracking and classification in medical image analysis. 
Many studies for brain tumor segmentation use deep learning techniques, espe- 
cially convolutional neural networks (CNN). Recent entries in the Brain Tumor 
Segmentation Challenge (BraTS) challenge are mostly based on these convo- 
lutional neural networks, specifically on the U-Net architecture [19] or similar, 
using an encoder and a decoder with skip-connections. They have shown very 
convincing performance in previous iterations of the challenge [12]. 
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The BraTS challenge provides the largest fully annotated, openly accessible 
database for model development and is the primary competition for objective 
comparison of segmentation methods [2—5,17]. The BraTS 2021 dataset includes 
1251 training cases and 219 validation cases. Reference annotations for the vali- 
dation set are not provided to participants. Instead, participants can utilize the 
online evaluation platform to evaluate their models and compare their results 
with other teams on the online leaderboard. In parallel to the segmentation task, 
the BraTS 2021 competition includes the task of predicting of the MGMT pro- 
moter methylation status in mpMRI scans. In this work, we only take part in 
the segmentation task. 

To segment these tumors, the BraTS dataset contains 5 images in NIfTI 
format for each patient. These images come from MRI (Magnetic Resonance 
Imaging), each of the first four images coming from different moments of the 
MRI. These different modalities are named T1, Tlce, T2 and FLAIR. The last 
image corresponds to the ground truth, i.e. the tumor and its different regions. 
The pixel values of this image are: 


— 4 for the GD-enhancing tumor 

— 2 for the peritumoral edematous/invaded tissue 
— 1 for the necrotic tumor core 

— 0 for everything else 


Using these pixel values, we can find the different tumor regions: 


— Whole Tumor (WT): 1, 2, 4 
— Tumor Core (TC): 1 and 4 
— Enhanced Tumor (ET): 4 


Fig. 1. Modalities and labels 


In this paper, we use a residual 3D U-Net using localization with cross- 
validation for each region (WT, TC, ET) (Fig. 1). 
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2 Methods 


The implementation used PyTorch. As a result, we describe the models with 
PyTorch keywords and methods. 


2.1 Pre-processing 


Images given in the BraTS dataset are in 240 (Width) x 240 (Height) x 155 
(Number of slices) x 4 (FLAIR, T1, Tice, T2), in the NIfTI format. 

The goal in our approach was to keep the images as close as possible to the 
original data despite the limitations of GPU memory, that is, without too much 
pre-processing on the input images. 

We chose to crop the images to 192 x 192 x 155 to remove the empty borders 
of the images, then added 5 empty slices to obtain a multiple of 8 on every 
dimension (except for the channels). 

As a result, the images given as input of the model are left with as little 
modification as possible. 


2.2 Residual 3D U-Net 


The model we are using is a Residual 3D U-Net, based on Superhuman Accuracy 
on the SNEMI3D Connectomics Challenge [16]. Residual U-Nets have already 
been used for biomedical applications [18,20]. Our model is a variant of the 
U-Net [19] in 3D [7]. 

The architecture inherits the main elements from U-Net: a contracting path 
with convolutions and downsampling, an expansive path with convolutions and 
upsampling, and skip connections from the contracting path to the expansive 
path. 

Our model differs from the 3D U-Net on different aspects, such as the use 
of same convolution instead of valid convolution. We also added a residual skip 
connection to each convolution block, it helps to solve the vanishing gradient 
problem and to preserve information captured in the initial layers. As we are 
limited by VRAM, we have to use a small batch size. We used Group Normal- 
ization as it performs better than Batch Normalization on a small batch size 
and it improves the ability of the network to generalize and allows the model to 
converge rapidly. 

In a residual block, the first two convolutions are preceded by group nor- 
malization and followed by the ReLU activation function. After the last con- 
volution layer, there is a concatenation of the residual connection, and then 
activation is called to include the residual information. 

On the contracting path of the U-Net, we use what we call an encoding 
residual block (ERB), which contains a MaxPool13d with a kernel size of 2 and a 
residual block. 
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Fig. 3. Residual 3D U-Net blocks 


On the expansive path of the U-Net, we use a decoding residual block 
(DRB), which contains a ConvTranspose3d layer with a kernel size of 3 and a 
scaling factor of size 2 to revert to the size of the encoding residual data from 
the same level skip connection. After the concatenation of the skip connection, 
the residual block is added. 

At the end of this network, a 1x1 convolution is used to reduce the number 
of output channels to the number of labels. The number of labels will be 3 for a 
multi-class prediction and 1 in the case of a single-class prediction. 

We have trained 3 separate single-class prediction models. One for each region 
WT, TC and ET which take 4 channels as input, again FLAIR, T1, T1ce and T2. 

As the three models predict one label, a sigmoid has been used as final 
activation function. 
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The architecture of this network is built to recover the original shape of the 
data by using padding on heights and widths of the images. (see Fig. 2). 

As the MRI images are quite heavy, having only 16GB of VRAM on our 
GPUs, it was necessary to use 24 filters on the first layer of the network to avoid 
saturating the GPU memory. However, we are able to run the model with more 
filters using localization which we discuss in the next section (Fig. 3). 


2.3 Localization 


Training the models on TC and ET did not give great results. These regions 
are particularly small and the models could not refine the predictions correctly. 
That is because the “base” model only uses 24 filters on the first level of the 
U-Net. 

Increasing the number of filters was not possible because of our VRAM limi- 
tations. We thought about an interpolation to reduce the size of our input images 
but this technique is too destructive. 

In order to make the best out of the VRAM limits, we use localization. It 
consists in using the predictions on WT, center the input images around the 
segmented tumors and crop the input images around these segmented tumors 
(Fig. 4). 


Fig. 4. FLAIR image of a brain with localization 


Using this method, we are able to crop the input images into much smaller 
images of size 128 x 128 x 128 instead of 192 x 192 x 160. Whole tumors can fit 
inside these cropped images. As a result, the VRAM usage decreased and we were 
able to increase the number of filters from 24 to 64 on the first convolutional 
layer of the U-Net. 

Once we have predicted the area of the tumor, we can run the models on 
WT, TC and ET with 64 filters using the cropped images as input. Note: We 
run the model on WT again with 64 filters to get the best results. 

With a higher number of filters, the model is able to capture more complex 
features such as in the TC and ET regions (Fig. 5). 
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Fig. 5. Prediction example (label on the left and prediction on the right) 


2.4 Loss Function 


The hybrid loss function is used to train the models for the WT and TC regions. 
This loss combines Dice loss with the standard binary cross-entropy (BCE) loss 
that is generally the default for segmentation models. Summing the two methods 
allows for some diversity in the loss while benefiting from the stability of BCE. 
Both losses have the same coefficients in the hybrid loss. 


N N 
1 a A 
BCE loss = =) H (Pn: 4n) = -5 > [Unlog(tn) + (1 — yn)log(1 — Gn)] (D) 
n=1 n=1 
. 2|X NY] 
Dice_loss = ——— (2) 
|X] +Y] 


For the ET region, the standard binary cross-entropy (BCE) loss was used 
as it requires more stability in training. 
2.5 Cross Validation 


Cross-validation is a method used to train multiple models and improve predic- 
tive performance. The test set is separated beforehand. 
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We chose the k-fold method which consists in dividing the dataset in 5 blocks. 
The k-fold method allows us to create 5 different models that each have one 
different validation block and the 4 remaining blocks as the training set. 

In order to maximize our scores, we combined the predictions of all 5 models 
by doing an average (Regression Voting Ensemble) of the weights then binarized 
the outputs. We also tried a majority vote (Classification Voting Ensemble) after 
the binarization. 


2.6 Post-processing 


The analysis of our results obtained on the validation set of BraTS shows that 
our predictions contained a large number of false positives, on the TC and ET 
regions. 

In order to decrease that number, we defined a threshold for the number of 
pixels on an image [12]. Each prediction containing a number of pixels below 
this threshold is considered an empty prediction because we know that a tumor 
does not necessarily contain an enhanced tumor (ET). Several threshold values 
were tested. 


3 Experiments and Results 


3.1 Implementation Details 


In the BraTS 2021 Segmentation Challenge, the training data is composed of 
1251 multimodal MRI cases. 

The network is implemented with PyTorch. The models were trained on 4 
NVIDIA Tesla V100 16 GB GPUs. Each model was trained for 40 to 60 epochs 
with a batch size of 4, Adam optimizer with a learning rate of 0.0001 (BCE-Dice 
loss), 0.00003 (BCE-Dice loss), 0.003 (BCE) respectively for WT, TC and ET. 
The model used to crop the input images and center on the tumor has the 
same learning rate as the model trained for WT and also uses the BCE-Dice 
loss. 

We reduced the learning rate with the callback ReduceLROnPlateau by a 
factor of 0.4, with a patience and a cooldown of 2 epochs. 


3.2 Performance on the Validation Set of BraTS 2021 


The Validation Dataset of BraTS 2021 contains 219 brains MRI. For each brain, 
the four modality (T1, T2, Tlce and FLAIR) are used in order to predict the 
multi-class prediction. Predictions are evaluated thanks to the Dice coefficient, 
the Hausdorff distance (Hausdorff95), the sensitivity (True Positive Rate) score 
and a specificity (True Negative Rate) score. They are defined as follows: 


2TP 
D ) = 
e = FP H2TP 4 FN (3) 
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Hausdor f f(T, P) = max{suprerinfpe pdt, p), Supperinfrerd(t,p)} (4) 


Sensitivity = 


Specificity = 


TP 


TP+FN 


TN 


TN+FP 


(6) 


where TP, FP, TN and FN denote respectively the number of true positive, false 
positive, true negative and false negative voxels. 

The Hausdorff distance computes the distance between the predicted regions 
and the ground truth regions. t and p denote respectively the pixels in the 
ground truth regions T and the predicted regions P. d(t,p) is the function that 
computes the distance between the points t and p (Tables 1 and 2). 


Table 1. Performance comparison using the dice coefficient on the BraTS 2021 Vali- 


dation set using the online tool 


Methods WT (%) TC (%)| ET (%) 
Baseline (BCE loss) 90.98 80.24 | 74.07 
Classification Voting Ensemble“ 91.42 80.96 77.07 
Regression Voting Ensemble (RVE)* 91.45 80.98 | 77.58 
BCE-Dice loss (WT & TC)? 91.34 82.71 77.58 
RVE? + Localization” 91.64 82.71 78.26 
RVE? + ET threshold 100° 91.45 80.98 78.91 
RVE + Localization? + ET threshold 400° | 91.64 82.71 78.71 
RVE + Localization? + ET threshold 600° | 91.64 82.71 | 80.22 


“Cross-Validation evaluation method (see Sect. 2.5) 
°BCE-Dice loss function (see Sect. 2.4) 
“Post processing using thresholding (see Sect. 2.6) 


“Second network with reframing around the WT (see Sect. 2.3) 


Table 2. Submission result on validation set 


Tumor region WT (%) | TC (%) | ET (%) 
Dice 91.64 82.71 80.22 
Hausdorff95 4.35 12.50 25.13 
Sensitivity 93.61 85.74 79.14 
Specificity 99.90 99.95 99.97 
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3.3 Performance on the test set of BraTS 2021 


Dice_WT (%) | Dice_TC (%) | Dice_ET (%) | HD95_WT | HD95_TC | HD95_ET 
Mean 89.21 81.30 80.64 11.81 26.57 32.18 
StdDev 16.87 28.79 26.52 47.15 86.49 98.98 
Median 94.42 93.63 90.86 2.24 2.0 1.41 
25quantile 89.92 86.17 80.48 1.41 1.0 1.0 
75quantile 96.72 96.54 95.03 5.10 4.97 3.0 


4 Conclusion 


In this paper, we propose a segmentation method using the Residual 3D U-Net 
as the skeleton of the network, which uses the four modalities on an area where 
the tumor has been predicted. The localization method allows us to exploit the 
limitations of VRAM to the fullest by cropping and centering on the whole tumor 
without any performance loss. The evaluation of our method on the BraTS 2021 
test set gives dice scores of 89.21, 81.30, 80.64 for the whole tumor, the tumor 
core and enhancing tumor, respectively. 
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Abstract. The goal of optimal mass transportation (OMT) is to trans- 
form any irregular 3D object (i.e., a brain image) into a cube without 
creating significant distortion, which is utilized to preprocess irregular 
brain samples to facilitate the tensor form of the input format of the U- 
net algorithm. The BraTS 2021 database newly provides a challenging 
platform for the detection and segmentation of brain tumors, namely, 
the whole tumor (WT), the tumor core (TC) and the enhanced tumor 
(ET), by AI techniques. We propose a two-phase OMT algorithm with 
density estimates for 3D brain tumor segmentation. In the first phase, we 
construct a volume-mass-preserving OMT via the density determined by 
the FLAIR grayscale of the scanned modality for the U-net and predict 
the possible tumor regions. Then, in the second phase, we increase the 
density on the region of interest and construct a new OMT to enlarge 
the target region of tumors for the U-net so that the U-net has a bet- 
ter chance to learn how to mark the correct segmentation labels. The 
application of this preprocessing OMT technique is a new and trending 
method for CNN training and validation. 


Keywords: Optimal mass transportation - Two-phase OMT - 
Volume-measure-preserving map - Irregular 3D image 


1 Introduction 


In recent years, the MSD2018 [1,2] and BraTS2020 [3-5] databases have pro- 
vided a challenging platform for brain tumor segmentation by AI techniques and 
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attracted enormous attention and interest from researchers in this field. Further- 
more, very recently, BraTS2021 [3,5-8] was jointly organized by the RSNA, the 
ASNR and the MICCAI society, which provides 1251 training and 219 validation 
brain samples with four scanned modalities, namely, fluid-attenuated inversion 
recovery (FLAIR), T1-weighted (T1), T1l-weighted contrast-enhanced (T1CE) 
and T2-weighted (T2), by multiparametric magnetic resonance imaging (mp- 
MRI) and focuses on the evaluation of state-of-the-art methods for the task of 
brain tumor segmentation on the whole tumor (WT labeled by {2,1,4}), the 
tumor core (TC labeled by {1,4}) and the enhanced tumor (ET labeled by {4}). 
To address this issue, convolutional neural network (CNN) structures with two 
layers [9] and eight layers [10] were proposed to make good progress in brain 
tumor segmentation. Then, a more sophisticated multiple CNN architecture, 
called the U-net model, was first developed in [11] and improved in [12] by 
assembling two full CNNs and a U-net. The merits of applying the U-net model 
to the challenge of MSD2018 were first proposed by [13]. 

The input data are one of the key components of the CNN. Experience 
has shown that adding a large amount of training data and expanding the size 
of trillion-parameter models can effectively provide excellent prediction perfor- 
mance. Because of the limitation of Moore’s law, the calculation of a super model 
can become extremely expensive and inefficient. For this reason, preprocessing 
for the effective representation of a large amount of input data becomes crucial. 
Taking an irregular 3D effective brain image from an MRI, which is generally 
composed of 1.5 million vertices, randomly selecting several cubes (e.g., 16 cube 
filters in [14]) with seamless coverage to overplay the irregular brain image is a 
natural way to fit the input format of tensors for the U-net system. An elegant 
two-stage optimal mass transportation (2SOMT) method newly proposed in [15] 
is designed to first transform an irregular brain image to a unit ball and then 
to a 128 x 128 x 128 cube with minimal distortion and small conversion loss. 
This strategy can greatly reduce the capacity of input data, so there are more 
opportunities to expand various types of training data and effectively use the 
existing U-net algorithm to improve the expected accuracy of prediction. How- 
ever, 2SOMT did not sufficiently make full usage of the density information in 
the brain image. 

In this paper, we propose a two-phase OMT algorithm for U-Net to improve 
the effectiveness of tumor segmentation. First, based on the projected gradi- 
ent method, we develop an OMT algorithm that maps an irregular 3D brain 
image to a cube directly and ensures its sublinear convergence. The character- 
istics of the OMT map are to preserve the local mass unchanged and minimize 
the distortion. With this peculiar feature, in the first phase, we construct a 
volume-mass-preserving OMT by FLAIR grayscales for the U-net and predict 
the possible region of tumors. In the second phase, we increase the density distri- 
bution of interesting regions with fine meshes in the brain image and construct 
a new OMT to enlarge the target region for the U-net so that the U-net learning 
program is similar to taking a magnifying glass to view and learn how to mark 
the segmentation labels. The application of this preprocessing OMT technique 
is indeed an innovative idea and the most streamlined method for CNN training 
and prediction. 
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2 Method 


2.1 Discrete OMT as a Preprocessing for the U-Net 


Let M be a simplicial 3-complex with a genus-zero boundary that describes 
an irregular 3D brain image. M is composed of sets of vertices V(M), edges 
E(M), faces F(M) and tetrahedrons T(M). A discrete OMT problem is to find 
a bijective function with minimal distortion that maps M to a canonical simple 
domain such as a ball, an ellipsoid, a cube or a cuboid. Since a tensor form is 
a necessary input format for the U-net algorithm, in this paper, we propose an 
OMT algorithm to map M to a cube while minimizing the transport cost by 
the projected gradient method. Without loss of generality, in this paper, each 
simplicial 3-complex M is centralized so that the center of mass is located at 
the origin and the mass of M is normalized to one. C is denoted as a unit cube 
with a constant density of one. 

Let p be a density map on V(M). The piecewise linear density functions of 
p on T(M) and the volume measure are respectively defined by 


4 
1 
“a i) TET (M), v € VO), (1a) 
i=1 
1 
mp(v) = zel) X Irl, TE T(M), v € V(M), (1b) 
VCT 


where |7| is the volume of 7. Denote 
Fp ={f:M >C] p(T) = |F), Yr E€ T(M)} (2) 


as the set of all volume-measure-preserving piecewise linear maps from M to C, in 
which the bijective maps between 7 and f(T) are determined by the barycentric 
coordinates on 7. The discrete OMT problem on M with respect to ||- ||2 is to 
find an f € F, that solves the optimal problem 


fp =argminc(f), withe(f)= $, lle- f@w)ll2mp(v). (3) 


FEF, vEV(M) 


Suppose g} = argming X sey(om) {ll — IlO) Xoca lal)} over g: OM — 
OC with p(a)ja| = |g(a)| for all a € F(OM), which is computed by area- 
measure-preserving OMT [16]. We now propose a volume-measure-preserving 
OMT algorithm for solving the OMT map f* from M to C for (3) by the pro- 
jected gradient method combined with the volume stretch energy minimization 
VSEM algorithm [16] with g% fixed on the boundary of 7 

We first compute a volume-measure-preserving map f°) by VSEM [17] with 
£0) = = gy, as a fixed boundary map, where g* is the inducing vector of g*. For 
k =0,1,..., we update the vector by 


£) = £® — (Ve f™)), (4) 
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where 7) > 0 is the step length determined by the line-search procedure. Then, we 
project f% onto the convex domain F, of (2). We fix £(*) = g% as the boundary 
map on OM and perform the VSEM [17] by updating the interior map fr by 


Ly tft = —L1 383) I= {1, tee ,n}\B, n= #V(M) (5) 


using the modified volume-weighted Laplacian matrix as L — L(f) defined in 
[17] at each iteration until the volume-weighted stretch energy E(f) = 4 (ft)! Lft 
converses, where f = i et |? = ft, Pier, 

Similar to the standard convergence analysis of the projected gradient 
method (see, e.g., [16,18]), the OMT algorithm can be proven to be convergent 
with a sublinear rate of O(1/k). 


2.2 Two-Phase OMT for Training and Validation 


A brain image scanned by mpMRI typically provides four modalities, FLAIR, 
T1, TICE and T2, with various grayscale values ranging from 0 to 65535 on 
each voxel of four 240 x 240 x 155 cuboids, denoted by {/,}4_,. For a training 
brain image, let £ denote the 240 x 240 x 155 labeled cuboid by WT = {2, 1, 4}, 
TC = {1,4}, ET = {4} and {0} for others. In practice, the grayscale values on I, 
can be normalized in [0,1], denoted by Is, by (grayscale value — mean) /variance 
with a suitable shift and scaling. 

An actual brain is contained in J, and accounts for approximately 12%—-20% 
voxels. Suppose M C I, is a simplicial 3-complex with a genus-zero boundary 
composed of tetrahedral meshes representing a brain image. The normalized 
grayscale on the voxel J,(i,j,k) can help with defining the density map on 
V(M) by 

ps(v) = exp(Is(i,j,k)), v € s(t, j, k). (6) 


Two-Phase OMT and U-Net Algorithm for Training. We now propose a 
two-phase OMT algorithm with estimates of density functions to construct the 
effective input tensor for the U-net algorithm. 


Phase I. We construct a density function on V(M) by pi(v) = exp(I1(i, j, k)) 
for v € 1, (i, j, k), as in (6), where J, records the normalized FLAIR grayscales. In 
general, the FLAIR modality typically reflects the distribution of WT = {2, 1, 4}. 
We compute the OMT map f*, as in (3) from M to a 128° cube No. Then, we 
compute four 128° cubes {No,,}4_, and one 128° cube Lo corresponding to the 
grayscales of M C Is, s=1,...,4, and labels in M C £, respectively, via the 
OMT map f}. Then, we call the U-net with the input data of 4x 128° {No,s}3— 
and one 128” Lo to train Net 0. 

Net 0 is designed to detect the possible tumor region of WT and then used 
to construct a new density function for enlarging the tumor region for phase II. 


Phase II. For a given training brain image, we expand the tumor region of WT 
with labeled 1 = {2, 1,4} outward by 5 voxels, say 7 C M, and construct a new 
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density function with fine meshes by 


(7) 


otherwise. 


> 


2 ieee velli j,k) CT, 


We compute the OMT map f}, from M to 128° cube M; and four 128° cubes 
{M s }4—ı corresponding to the grayscale values of M C I, via the OMT F 
Then, we construct three 128° cubes £1, £2 and £3 associated with the labels 
of {0 = {0},1 = {2,1,4} = WT}, {0 = {0,2},1 = {1,4} = TC}, and {0 = 
{0,2,1},1 = {4} = ET}, respectively. Then, we call the U-net with the input 
data of 4 x 128° {No,,}4_, and Lj, respectively, for j = 1,2,3, to train three 
nets, namely, Net 1, Net 2, and Net 3. 


Net 0, Net 1—Net 3 for Validation. Once we have computed Net 0 and 
Net 1-Net 3 by phase I and phase II, respectively, we use Net 0 to detect the 
possible tumor region of WT = {2,1,4} and expand this region outward by 5 
voxels, say J C M, and construct a new density function 6, depending on T 
with fine meshes as in (7). We compute four 128° cubes {M s}; for grayscale 
values of FLAIR, T1, TICE and T2 via OMT fj, and use Net 1, Net 2 and Net 
3 to validate three 128° cubes £1, Lə and £3 for predicted labels. 
Validation of a testing brain image: 


(i) Net 1 — {0 = {0}, 1 = {2,1,4}} on £; 
Net 2 — {0 = {0,2},1 = {1,4}} on Lo; 
Net 3 — {0 = {0,2,1},1 = {4}} on £3. 

(ii) According to the labels {0,1} on £1, we mark a 128° cube £ by {0} and 
{1} for labels “0” and “1” on £1, respectively; 
According to the labels {0,1} on £2, we mark £ by {1} for label “1” on 
Lo; 
According to the labels {0,1} on £3, we mark £ by {4} for label “1” on 
Ls. 

(iii) Let w denote the center of a voxel vc M. Voxel v is labeled by fia (w), 
where f= (w) is contained in some voxel of £. 


The flow chart of the two-phase OMT and the U-net algorithm for training 
and validation is summarized in Fig. 1. 


3 Results 


As in the previous sections, the OMT map transforms irregular 3D brain images 
into cubes while preserving the local mass and minimizing the deformation, 
which makes the U-net algorithm train an effective prediction function for brain 
tumor detection and segmentation. 
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128 
OMT U-net 
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fos 
BraTS 2021 pı = density function 
(1251 images) by Flair 
(1251 cubes) 


(a) 1251 OMT maps with density function pı by the grayscales of 
Flairs, and the Net 0 computed by the U-net algorithm. 


128? 
Inverse 
— 
OMT 


Predict 


_ > 
Possible 
WT regions 
WT prediction WT prediction 
on cube on brain 
(1251 cubes) (1251 images) 


240 x 240 x 155 


Extend 
= 5 voxels 


1251 MRIs Enlarge prediction region 
in brain 
(1251 images) 
(b) Use Net 0 to do object detection of the testing data and get a possible 
WT region. Augment the WT region by 5 voxel outward and test it by the 
Net 1—Net 3 to get the segmentation labels. 


Fig. 1. (a) Phase I: construct the Net 0 to predict the possible WT region; (b) Phase 
II: construct Net 1—Net 3 to evaluate the possible labels on the original brain image. 


Conversion Loss Between Cubes and Original Brains. Let A and 5 denote 
the label sets of the ground truth labels and the conversion labels by fž, on 
M C £ for the WT (labeled by {2,1,4}), TC (labeled by {1,4}) and ET (labeled 
{4}), respectively. We define the conversion loss by 1— eee where |A| denotes 
the cardinal number of A. In Table 1, we illustrate the average of the conversion 
loss between brains and cubes for WT, TC and ET of all 1251 brain images in 
the BraTS 2021 Challenge dataset with typical grid sizes of 96? and 128° by the 
OMT map f;,. 
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Table 1. Conversion loss between brains and cubes with grid sizes of 96° and 128°, 
respectively. 


OMT- fi, Wr ITC JET 
Conversion loss for 963 | 0.43% |0.30% |0.65% 
Conversion loss for 128° | 0.084% | 0.026% | 0.047% 


We see that the deformation of OMT- fž, from M to the 128° cube does not 
produce a considerable accuracy loss, and the maximal conversion loss of the 
WT is less than 0.084%. On the other hand, the maximal conversion loss of ET 
is 0.65%, which is not adequate for constructing a good prediction function, even 
though a cube size of 96° would save considerable computational cost. Therefore, 
the size of the cube with 128° is an excellent choice that not only has a smaller 
conversion loss between cubes and the original brains but also matches the input 
limitation of the U-net algorithm. 

Furthermore, in Table2, we show the average percentages of the WT, TC 
and ET in the original brain and in the 128? cube by the OMT- fj, with the 
new density function as in (7). The WT accounts for 6.49% of the raw data of 
the original brain. However, under the newly constructed density function and 
enhanced histogram equalization of the grayscale and OMT- fž, map in phase II 
of Sect. 2.2, the WT is enhanced almost twofold in cube, reaching 20.28%. This 
indeed helps with detecting various tumors in brains by the U-net algorithm. 


Table 2. The average percentages of tumors in the raw data of size 240 x 240 x 155 
and cubes of size 128 x 128 x 128 computed by the OMT- f5,- 


Data type WT TC ET 
Tumor in the raw data (240 x 240 x 155) | 6.49% | 2.42% | 1.45% 
Tumor in the cube (128 x 128 x 128) 20.28% | 7.62% | 4.62% 


Dice Score of Validation and Testing. As in Sect. 2.2, we train Net 0, Net 
1 - Net 3 by using the U-net algorithm on the 1251 brain samples from BraTS 
2021 Challenge dataset [6]. BraTS 2021 dataset contains 2000 brain images. An 
online evaluation platform for BraTS 2021 was recently opened and provided 219 
unlabeled brain image samples for validation. The others are unreleased brain 
image samples for testing. The feedback Dice scores of the WT, TC, and ET for 
validation and testing presented in Table 3 are evaluated by Net 0, Net 1 - Net 
3, at 160 epochs by the U-net algorithm. 
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Table 3. The Dice scores of the WT, TC, and ET in the validation and testing sets 
with 160 epochs in the U-net algorithm. 


Validation | Testing 


Mean StdDev Median | 25 quantile | 75 quantile 
WT | 0.9200 0.5205 | 0.4508 0.9344 |0 0.9368 
TC |0.8523 0.4872 | 0.4586 0.7070 |0 0.9479 
ET | 0.8289 0.4722 | 0.4368 0.7078 |O 0.9026 


Other measurements of sensitivity, specificity for the voxelwise overlap in 
the segmented regions, and the Hausdorff dimension HD95 for the evaluation of 
the distance between segmentation boundaries are all calculated and shown in 
Table 4. The Dice scores for the testing data are unsatisfactory, probably because 
our executables did not recognize the types of orientations of the testing data. 
All the training data we use are in LPI voxel-order; however, the testing data 
orientations are either RAI or LPI voxel-order. As a result, the Dice scores 
for those testing data in RAI voxel-order would be terrible due to the wrong 
orientation. This issue should be remedied in our future release. 


Table 4. Sensitivity, specificity and HD95 for the WT, TC, and ET for the validation 
and testing sets with 160 epochs in the U-net algorithm. 


Sensitivity Specificity HD95 

WT |TC ET WT TC ET WT |TC |ET 
Validation 0.9259 | 0.8511 | 0.8431 | 0.9993 | 0.9998 | 0.9997 | 3.800 | 8.210 | 16.33 
Testing | Mean 0.5139 | 0.4942 | 0.4732 | 0.5786 | 0.5787 | 0.5788 | 161.0 | 168.9 | 167.9 


StdDev 0.4450 | 0.4603 | 0.4420 | 0.4934 | 0.4935 | 0.4936 | 181.8 | 183.4 | 184.2 
Median 0.8117 | 0.7637 | 0.6607 | 0.9988 | 0.9995 | 0.9995 | 8.106 | 12.57 | 4.241 
25 quantile 0 0 0 0 0 0 2.236 | 1.732 | 1.414 
75 quantile | 0.9239 | 0.9524 | 0.9156 | 0.9998 | 0.9999 | 0.9999 | 374 |374 | 374 


To further understand the specific advantage of the two-phase OMT maps 
while preserving the local mass ratios, as well as minimizing the transport cost 
and distortion, we randomly divide 1251 brain samples into 1000 samples for 
training and 251 for validation. The Dice scores of WT, TC, and ET in the 
cubes and brains, respectively, for training and validation shown in Table 5 are 
computed by Net 0, Net 1 - Net 3. Without augmenting the data and performing 
any postprocessing in this work, the Dice scores in Table 5 are quite satisfactory. 
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Table 5. The Dice scores of the WT, TC, and ET in brains in the training and 
validation with 160 epochs in the U-net algorithm. 


Epochs Dice score (Brains) 
160 WT ITC ET 
Training | 0.9614 | 0.9340 | 0.9121 
Validation | 0.9317 | 0.8896 | 0.8564 


4 Discussion 


This work mainly introduces the 2-phase OMT technique for 3D brain tumor 
detection and segmentation. The OMT technique to this research area was first 
introduced by Lin et al. [15]. However, the density function estimates for the 
prediction of possible tumor regions were not sufficiently utilized in [15]. In this 
paper, we first use FLAIR grayscales to construct a corresponding density func- 
tion for the OMT to transform an irregular 3D brain image to a 128° cube with 
minimal distortion, which is particularly beneficial to the U-net algorithm’s input 
format for creating a predicting Net 0. Second, we use it to predict the possi- 
ble tumor regions expanding outward 5 voxels and construct an associated step 
density function on the brain. Then, we perform U-net with this new density 
function to train three nets, Net 1 - Net 3, for label evaluations of the validation 
set. The use of the OMT map to convert an irregular 3D image to a cube with 
minimal transport cost and local mass ratio is a new attempt to introduce it into 
the medical imaging field. For a brain image that only needs to be represented by 
a cube, which saves considerable capacity in the computer environment for the 
input data of the U-net, an augmentation technique, such as rotating, mirror- 
ing, shearing, and cropping, as well as a postprocessing technique are our next 
research topics. 
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Abstract. We apply a cascaded training pipeline for the 3D U-Net to segment each 
brain tumor sub-region separately and chronologically. Firstly, the volumetric data 
of four modalities are used to segment the whole tumor in the first round of training. 
Then, our model combines the whole tumor segmentation with the mpMRI images 
to segment the tumor core. Finally, the network uses whole tumor and tumor core 
segmentations to predict enhancing tumor regions. Unlike the standard 3D U-Net, 
we use Group Normalization and Randomized Leaky Rectified Linear Unit in 
the encoding and decoding blocks. We achieved dice scores on the validation set 
of 88.84, 81.97, and 75.02 for whole tumor, tumor core, and enhancing tumor, 
respectively. 


Keywords: 3D U-Net - Brain tumor segmentation - Medical image segmentation 


1 Introduction 


Glioblastoma is the most common malignant primary brain tumor in humans. The tumor 
has a variety of histological sub-regions, including edema/invasion, active tumor struc- 
tures, necrotic components, and non-enhancing gross abnormalities. Accurate segmenta- 
tion of these intrinsic sub-regions using Magnetic Resonance Imaging (MRI) is critical 
for the potential diagnosis and treatment of this disease. In most clinical centers, the 
segmentation of whole tumor and sub-compartment is still performed manually and 
is considered the standard approach. However, manual segmentation takes time and 
requires skilled experts, hence it is crucial to employ fully automated segmentation tools 
capable of segmenting brain tumor sub-regions. Recently, Deep learning (DL) has been 
widely adopted for medical imaging thanks to its ability to learn complicated repre- 
sentations from raw data without requiring human engineering and domain expertise to 
create feature extractors. Therefore, a considerable number of studies about DL appli- 
cations in brain tumor segmentation have been introduced, demonstrating its success 
in the field. Moreover, the Multimodal Brain Tumor Segmentation Challenge (BraTS) 
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dataset, which includes multi-institutional multimodal MRI scans of glioblastoma and 
lower grade glioma, has attracted many researchers to submit their fully automatic brain 
tumor segmentation algorithms and received significant results. 

The U-Net, introduced by Ronneberger et al. [1], was the first high-impact encoder- 
decoder structure that was widely employed for medical image segmentation. It com- 
prises a contracting path and an expanding path, which is similar to the fully convolutional 
network architecture. However, the novelty of U-Net lies in the fact that the up-sampling 
and down-sampling layers are joined via skip connections to connect opposing convolu- 
tion and deconvolution layers. The U-Net’s symmetric structure with skip connections 
was a perfect solution for medical imaging segmentation tasks because it can combine 
low-level and high-level features in medical images to recognize objects that contain 
noise and blurred boundaries. Several variants of U-Net that are capable of perform- 
ing 3D segmentation were later introduced and achieved noteworthy advancement. For 
instance, Çiçek et al. [2] suggested a 3D U-Net by substituting 2D operations in 2D U-Net 
with 3D counterparts, while Milletari et al. [3] built a 3D-variant of U-Net architecture 
called V-net by employing residual blocks. Because these architectures are trained on 
entire images or large image patches rather than small patches, they are influenced by 
data scarcity, which is often handled via data transformations like shifting, rotating, 
scaling, or random deformations. The research of Kayalibay et al. [4] employed a 3D 
U-Net liked network architecture for bone and brain tumor segmentation. They com- 
bined different segmentation maps created at various scales to speed up convergence. 
However, because this method used wide receptive fields in convolutional layers, it can 
be computationally costly. Isensee et al. [5] proposed a 3D U-Net with modifications 
to the up-sampling pathways, filters number, methods of normalization, and the batch 
size, enabling training with large image patches and capturing spatial data that leads to 
improvements in segmentation performance. A separable 3D U-Net made up of sparable 
3D convolutions was proposed in a more recent paper by Chen et al. [6]. Using several 3D 
U-Net blocks, their S3D-UNet design fully utilizes the 3D volumes. It’s also worth not- 
ing that the winning submissions to the BraTS 2019 and BraTS 2020 used U-Net-based 
designs as well. While Jiang et al. [7] utilized a two-stage cascaded U-Net, Isensee et al. 
[8] used the nnU-Net architecture that was originally developed as a general-purpose 
U-Net based network for segmentation. 

In this study, we propose a chronological cascaded 3D U-Net network, which con- 
catenates segmentation from the previous round to the next ones. Each training round 
is performed as a normal brain tumor segmentation training process. We also apply a 
customized weighted dice loss function to give weights for different losses on brain 
tumor sub-regions. We achieved dice scores on the validation set of 88.84, 81.97, and 
75.02 for WT, TC, and ET, respectively. 


2 Dataset 


We use the RSNA-ASNR-MICCAI BraTS 2021 challenge training and validation 
datasets [9-13], which include 1251 and 219 cases respectively. Each case contains 4 
volumetric MRI scans from 4 different modalities, which are a) native (T1) and b) post- 
contrast T1-weighted (T1Gd (Gadolinium)), c) T2-weighted (T2), and d) T2 Fluid Atten- 
uated Inversion Recovery (T2-FLAIR) volumes. All BraTS mpMRI scans have been 
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applied standardized pre-processing routines, such as NIFTI file format conversion, re- 
orientation, co-registration, resampling to 1 mm°, and skull-tripping. Annotations in the 
datasets were approved by board-certified experts who have been working with glioma 
for more than 15 years. Each annotation includes the necrotic tumor core (NCR—label 
1), the peritumoral edematous/invaded tissue (ED—label 2), and Gd-enhancing tumor 
(ET—label 4). 


label channel 1 - TC label channel 2 - ET 


100 100 


150 


150 


200 200 


150 200 0 50 100 150 200 


Fig. 1. Overview ofthe training images. The first row displays a sample input data in four channels, 
corresponding to four modalities. The second row shows three tumor sub-regions in a sample 
ground truth: WT, TC, and ET. 


2.1 Tumor Distribution 


From the training dataset, we found that almost every slice in an MR image might 
contain tumors, even some first few slices. There are only several last slices of the image 
that does not contain any tumor. Therefore, to keep the tumor information as much as 
possible, instead of reducing the depth of the 3D image to 128 slices like in other popular 
methods, we keep the number of slices to be 155. 

Also, in order to reduce the computational cost, we want to reduce the height and 
width of the 3D MR images to a size that is small enough and is consistent among the 
dataset but at the same time preserve the tumor regions. To select a good region of interest 
for the training images, we analyze the tumor distribution by adding all segmentation 
images in the dataset together and visualizing the regions that are likely to include tumors. 
The background with values 0 is still black, but regions that many tumors occur contain 
high values and become brighter in the visualization. We examined the summary images 
in both 3D and 2D versions to detect regions where tumors appear crowdedly. 
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Fig. 2. A 2D plot of the tumor regions of all training data. The brighter the color is, the more 
likely the tumors occur in that region. (Color figure online) 


Figure 2 shows that tumors appear symmetrically on both hemispheres among images 
in the training dataset, represented by the bright color. Overall, the areas that are more 
likely to contain tumors are central parts of both hemispheres. In the summary image, 
the hemisphere on the left side has brighter and slightly larger tumor-likely regions than 
the one on the right. It is represented by the larger white area on the left, which has a 
clearer boundary. This indicates that in the training dataset, tumors occur more on the 
left part of the image and their locations are concentrated. 


3 Method 


3.1 Overall Architecture 


We propose a training architecture that consists of three training rounds, each round 
includes a 3D U-Net architecture inspired by [2] and [8] with some minor modifica- 
tions. We call them the Modified 3D U-Net. Our network segments each brain tumor 
sub-regions separately, from the largest region to the smallest, then concatenate the seg- 
mentation of the current sub-region with the input data to feed to the network of the next 
training round. In other words, we train the volumetric images of four modalities and 
segment the WT in the first round of training, then combine the WT segmentation with 
the mpMRI images to segment TC, and finally use WT and TC segmentations to predict 
the ET region. Each network in a training round is fully independent. The network for 
WT was trained, its weights frozen and the output used for the subsequent networks. 
This allows flexible adjustment of hyperparameters for each network corresponding to 
the output tumor sub-regions. The training process is described in detail in Fig. 1 (Fig. 3). 
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Fig. 3. Our proposed training pipeline consists of three training rounds, which are trained to 
segment a specific type of brain tumor sub-regions. 


3.2 The Modified 3D U-Net 


This section provides a complete description and explanation of the modified 3D U-net 
used in a training round. This is the main building block of the training process. 

Like the standard U-Net and 3D U-Net, our network has an encoder and a decoder 
which are interconnected by skip connections. The input data is center cropped to 155 
x 160 x 160. In the encoding part, each layer contains a block of double 3 x 3 x 3 
convolutions and Group Normalization (GN) [14] with a group size of 8. The layers are 
followed by a Randomized Leaky Rectified Linear Unit (RReLU) [15] activation, with 
randomly sampled from a uniform distribution. After the block, a2 x 2 x 2 max-pooling 
with strides of 2 is applied. In the decoding part, upsampling using the trilinear algorithm 
is performed. Skip connections from layers of equal resolution provide information from 
the low-level features. The output is applied a sigmoid nonlinearity function with a 
threshold of 0.5. 

The training objective is the dice loss function, which is discussed in more detail in 
Sect. 3.4. The loss operates on three brain tumor sub-region WT, TC, and ET separately. 
Our network uses Adam [16] optimization function with a learning rate of le—4. The 
learning rate is decayed by a factor or le—2 when the metric has stopped improving 
for 2 epochs. The model is trained with 3 rounds for 3 sub-regions, each round with 50 
epochs. The training procedure is described in Sect. 3.4 (Fig. 4). 
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Fig. 4. The proposed U-net based architecture with minor modifications which is used in each 
training round. 


3.3 Region-Based Training 


Although the provided labels for the data are ‘necrotic tumor core’, ‘edema’ , and ‘enhanc- 
ing tumor’, the evaluation of the segmentation is performed on the three overlapping 
brain tumor sub-regions called enhancing tumor (ET, label 4), tumor core (TC, label 1 
and 4), and whole tumor (WT, label 1, 2 and 4). We found that the network’s performance 
improved when segmenting these tumor sub-regions directly. Therefore, we change the 
optimization target to the brain tumor sub-regions for the input data and apply a sigmoid 
function for the output of the network. 


3.4 Loss Function 


Dice loss originates from the Sgrensen—Dice coefficient (DSC), which measures the 
overlapping regions between two sets. The coefficient ranges from 0 to 1, with 0 means 
no overlap and 1 means perfect overlap. The dice loss is set as 1 — DSC, therefore, it 
also ranges from 0 to 1, the smaller the better. 


a 2x VN pix git e 
N N 
iether 
In the above formula, p; and g; represent the corresponding voxel values of prediction 
and ground truth respectively. € = le—6 is a tiny number added to both numerator and 


denominator to avoid zero division. 
In this study, DCS is used as the loss function during training and validation. 


Laice =1 (1) 


3.5 Metrics 


There are several metrics used in the evaluation of this semantic segmentation network, 
such as DCS, sensitivity, specificity, and Hausdorff95 distance. DCS is a statistical- 
based metric that scores the overlapped region of the segmentation result and the ground 
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truth. Sensitivity is the proportion of genuinely positive voxels in the 3D image that is 
correctly classified as positive. Meanwhile, specificity represents the proportion of truly 
negative voxels that are correctly classified as negative. Lastly, the Hausdorff95 is the 
95th percentile of distances from the boundary points in set X to the nearest point in set 
Y. Using the 95th percentile instead of the maximum values eliminate the impact of a 
small subset of outliers. 

With TP indicating true positive, TN true negative, FP false positive, FN false 
negative, the mathematical calculations of these metrics are presented as follows. 


2TP 
DCS = ———__—- (2) 
2TP + FP + FN 

hoe: TP 

Sensitivity = ———— (3) 
TP + FN 

TN 
Specificity = ——— 4 
pecificity TN 4 FP (4) 
Hausdorff95 distance = Pos[supxex d(x, Y), supyey d(X, y)| (5) 


3.6 Training Procedure 


The training procedure includes three training rounds. Each round is trained as a normal 
training process with input is volumetric data and output is the segmentation for one 
tumor sub-region. The later round use results from the previous concatenated with the 
same volumetric data as the input. The first round is to train WT segmentation, which 
is the largest region, then the second round is for TC, and the last round is dedicated 
for ET, which is the hardest sub-region to segment. This training procedure gives more 
information to the model to segment the smaller sub-regions and help the later training 
rounds converge faster. 


3.7 Postprocessing 


The output of the model has the dimension of 3 x 155 x 160 x 160 with three layers 
of tumor sub-regions are stacked together. Also, the label values of the output are O 
(background) and 1 (tumor). However, the segmentation’s dimension required to be 
evaluated is 240 x 240 x 155 with labels 0, 1, 2, 4 as described in Sect. 2. Therefore, we 
need to add zero paddings around the heigh and width of the model’s output and change 
the order of dimension so that the final output’s dimension becomes 240 x 240 x 155. 

We also need to convert the binary values of three overlapped brain tumor sub-regions 
into 0, 1, 2, 4 correspondingly. The output consists of 3 segmentations of brain tumor 
sub-regions, each segmentation contains values 0 or 1. If value 1 appears on all three 
tumor sub-regions, it is mapped to value 4, because the ET only occurs inside the TC, 
which also only occurs inside WT. If value 1 appears in 2 sub-regions, it is mapped as 
follows (Table 1). 
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Table 1. The conversion table for mapping binary values of voxels 


WT TC ET 
WT 1 4 
TC 

ET 4 4 


Tumor voxels (voxels have value 1) appear in both WT and ET, or TC and ET are 
converted to 4, while ones only appear in ET are changed to 0 as background. This 
removes extra clusters that do not belong to the brain tumor region or reduce the amount 
of false positive classification on ET. In addition, tumor voxels appear only on TC or 
will be converted to 1 as those appear on both WT and TC. This avoids the effect of the 
WT segmentation “eating” other brain tumor small sub-regions (Fig. 5). 


3 x 155 x 160 x 160 3 x 155 x 240 x 240 3 x 240 x 240 x 155 240 x 240 x 155 
<a Zero E Move Convert 
Padding —— Axes values 
155 x 160 = 
x 160 y z 
\ 155 x 240 x 240 240 x 240 x 155 240 x 240 x 155 
155 x 160 = 
x 160 A 
EAN 155 x 240 x 240 240 x 240 x 155 
155 x 160 
x 160 x 
DY With labels [0,1] 
155 x 240 x 240 240 x 240 x 155 F> 
h è | With labels [0,1, 2, 4] 


Fig. 5. Diagram of our post-processing step that converts a stack of three blocks of binary 
segmentation to one block of non-binary segmentation 


4 Experiments and Results 


The network is trained using the Pytorch framework on an Nvidia GeForce GTX 1080Ti 
GPU. The GPU memory allows training with a batch size of 1. All mpMRI scans in the 
training dataset are used in the proposed network. The BraTS 2021 training dataset is 
divided into a training and validation set. Then the segmentation files are evaluated by 
the online platform. According to the reported results, our proposed network achieves 
dice score and Hausdorff distance (95%) (HD95) as follows (Table 2). 

Overall, performance on WT segmentation is the highest among brain tumor sub- 
regions, indicated by the highest DSC and lowest HD95. This agrees with the fact that 
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Table 2. Mean Dice Score (DSC) and Hausdorff distance (95%) (HD95), of our proposed 
segmentation network on the BraTS 2021 validation dataset using the online evaluation portal. 


WT TC ET Mean 
DSC 88.84 81.97 75.02 81.94 
HD95 7.97 12.60 30.41 16.99 
Sensitivity 90.32 83.40 76.94 83.55 
Specificity 99.91 99.96 99.97 99.95 


the WT sub-region is larger in shape and the boundary is smoother than the other sub- 
regions. Segmentation for ET has the lowest DSC and a very large HD95 because there 
are cases where there are ET voxels in the ground truth, but our proposed network 
failed to predict. This causes the DCS of that case to lower to 0 and the HD95 becomes 
maximum. In other cases, where ET is predicted, the DCS for ET is high. Specifically, 
the median of DSC of ET is 86.89, which means 50% of the validation cases of ET have 
DSC higher than or equal to 86.89. This also happens to TC but in a smaller amount. 
Therefore, the median of DSC for TC is a lot higher (91.35), which is only slightly 
smaller than the median of WT (92.50). 


5 Discussion 


This manuscript describes our method for our participation in the BraTS 2021 challenge 
segmentation task. We proposed a cascaded training pipeline for the 3D brain tumor 
segmentation task. Our proposed method trains a model to segment each brain tumor 
sub-regions in a different neural network, then use the results from the previous training 
round as the input for the next one. Our training starts with WT, then TC, and ET is 
the last to be segmented. We use 3D U-Net-inspired architecture with modifications to 
train each of the sub-regions. Out results obtained on the validation set are dice scores of 
88.84, 81.97, and 75.02 for WT, TC, and ET respectively. Because of the timeframe of 
the challenge, our manuscript only covers a small number of modifications and therefore 
has limitations. 


Data Augmentation Has not Been Applied. Data Augmentation includes techniques to 
increase the amount of training data by giving the model more modified copies of the 
existing ones. They help the model increase generalization and reduce the effect of 
the class imbalance issue. In this study, we haven’t applied data augmentation to the 
training data, this may hurt the model’s performance, especially on cases where ET does 
not appear. 


The Algorithm to Map Labels from 0 and I to 0, 1, 2, and 4 Needs to Be Implemented 
Carefully. The output of our proposed model contains three layers of segmentation 
for each brain tumor sub-regions and we need to convert them into one layer. There are 
disagreements from each segmentation of brain tumor sub-regions. The algorithm to con- 
vert those disagreements into proper classification strongly affects the final segmentation 
results. Further analysis should be done to improve the current mapping algorithm. 
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Different Architectures for Each Training Round Should Have Been Experimented. 
Since we train the model to segment WT, TC and ET separately, there is room for 
experiments on different architectures for each training round. The depth of the archi- 
tecture could be considered to learn more detailed information as ET has more uncertain 
shapes than the others. 


Larger Batch Sizes have not been Experimented. Due to the limitation of our training 
resources, we do not implement the training pipeline with different batch sizes except 
for the batch size of 1. This may not help in observing and analyzing the benefit of Group 
Normalization. We believe that more thorough optimization of hyperparameters could 
result in faster convergence as well as further performance gain. 
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Abstract. Brain tumor segmentation in multi-model MRI scans is a long-term 
and challenging task. Motivated by the winner solution in BraTS 2020 [7], we 
incorporate region-based training, a more aggressive data augmentation, and loss 
ensembles to build the widely used nnUNet model. Specifically, we train ten 
cross-validation models based on two compound loss functions and select the 
five best models for ensembles. On the final testing set, our method achieves 
average Dice scores of 0.8760, 0.8843, and 0.9300 and 95% Hausdorff Distance 
values of 12.3, 15.3, and 4.75 for enhancing tumor, tumor core, and whole tumor 
respectively. 


Keywords: nnUNet - Segmentation - Brain tumor - Loss function 


1 Introduction 


Brain tumor segmentation (BraTS) is a long-term and well-known challenge in medi- 
cal image processing community, which aims to evaluate and compare different state- 
of-the-art brain tumor segmentation methods in multi-parametric magnetic resonance 
imaging (mpMRI) scans. In the recent BraTS 2021 challenge, participants were called 
to produce segmentation labels of three different glioma sub-regions: the enhancing 
tumor (ET), the tumor core (TC), and the whole tumor (WT) from pre-operative four- 
sequence MRI scans (T1-weighted, post-contrast T1-weighted, T2-weighted, and T2 
Fluid Attenuated Inversion Recovery) which were acquired with different clinical pro- 
tocols and various scanners from multiple institutions. 

nnUNet [6] has been widely used in various medical image segmentation chal- 
lenges. For example, nine in ten top solutions developed their solution based on nnUNet 
in MICCAI 2020 [11]. The winner solution [7] in BraTS 2020 also employed nnUNet 
and some BraTS-specific modifications were developed to further improve the perfor- 
mance, including region-based training, postprocessing, increasing batch size, using 
more data augmentation, replacing the default instance normalization with batch nor- 
malization, using batch Dice loss rather than sample Dice loss, and model selection with 
BraTS-like ranking. The final model is the ensembles of three top performing models, 
including 25 cross-validation models totally. 


© The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 
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Our solution is based on the BraTS2020 winner solution with the publicly available 
nnUNet trainer: nnUNetTrainerV2BraTSRegions, including the region-based training 
and more data augmentation. The main modification is that we train two groups of 
models with different loss functions. Doing more ablation studies with other settings 
(e.g., batch Dice loss) is desired, but we do not have enough computational resources 
to run experiments. Due to the storage limitation (the data located on HDD rather than 
SSD), training one cross-validation model costs about 15 days on NVIDIA TITAN 
V100 GPU. 


2 Methods 


Loss function is an important component in modern deep learning networks. Recent 
study has shown that none of the popular segmentation loss functions can consis- 
tently achieve the best performance on multiple segmentation tasks and compound 
loss functions are the most robust and competitive losses [12]. Moreover, it has been 
proved that directly optimizing the partially overlapping brain tumor regions (whole 
tumor, tumor core, and enhancing tumor) can benefit segmentation performance in 
BraTS challenge [7,8, 15, 19,20]. Motivated by these successful solutions, we train two 
groups (nnUNetTrainerV2BraTSRegions) nnUNet models with DiceCE loss and Dice- 
TopK loss and select best models for final ensembles. The loss ensemble strategy was 
also employed by the winning method of intracranial aneurysms segmentation chal- 
lenge [10,18], which is a highly imbalanced segmentation task as well. The loss func- 
tion details are provided as follows. Let g; and s; denote the ground truth and the pre- 
dicted segmentation, respectively. N and C denote the number of voxels and categories 
in the ground truth mask, respectively. 


2.1 Cross Entropy Loss 


1 N C 
Lor =-Ẹ 2 2 gilogsi, (1) 


where gf is binary indicator if class label c is the correct classification for pixel 7, and 
sç is the corresponding predicted probability. 


2.2 Dice Loss 


Dice similarity coefficient (DSC) is the most commonly used segmentation evaluation 
metric and Dice loss [14] is designed to directly optimize the DSC, which is defined by 


N Cc 
1 2J i1 Dent ISS 
N C N C i 
ey ye g? + Dii pee s? 


Unlike weighted cross entropy, it does not require class re-weighting for imbalanced 
segmentation tasks. 


(2) 


L Dice = 
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2.3 TopK Loss 


TopK loss aims to force networks to focus on hard samples during training. It retains the 
k% worst pixels for loss, irrespective of their loss/probability values, which is defined 


by e 
1 
LTopK = N X 5 g; log s, (3) 
c=1 iEK 


where K is the set of the k% worst pixels. We choose k = 10% in our experiments 
because other percentage settings do not show remarkable improvements [12]. 


2.4 Compound Loss 


Compound loss functions have been proven to be relatively robust [12]. Thus, we use 
two compound loss functions to train the nnUNet models: 


(1) DiceCE loss: Dice loss plus cross entropy 


LDiceCE = LDice F Loegr; (4) 
(2) DiceTopK loss: Dice loss plus TopK loss 


LDiceTopK = L Dice + LTopK- (5) 
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Fig. 1. An overview of network architecture. 
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2.5 NnUNet Network Architecture 


We employ the default nnUNet [6] as the main network architecture because it has 
shown very strong performance on many segmentation tasks [11]. Figure 1 presents an 
overview of the network architecture. Specifically, it contains an encoder and a decoder 
that are composed by plain 3 x 3 x 3 convolutions, 1 x 1 x 1 convolutions, 2 x 2 x 2 con- 
volutions and transposed convolutions, instance normalization, Leaky ReLU, and skip 
connections. The number of channels is displayed in the encoder part of the network. 
Deep supervision [5] (green box) is added to all but the two lowest resolutions in the 
decoder. 

Motivated by the BraTS 2020 winning solution [7], the main modification to the 
default nnUNet is to replace the softmax function with a sigmoid function and change 
the optimization target to the three tumor subregions (whole tumor, tumor core, and 
enhancing tumor). 


3 Experimental Results 


3.1 Environment Setting 


All the experiments are based on Multimodal Brain Tumor Segmentation Challenge 
2021 (BraTS 2020) dataset [1-3, 13, 16,17], which includes 1251 training cases and 219 
validation cases. We train five cross-validation models with DiceCE loss and DiceTopK 
loss, respectively. The final model is the ensemble of five best-fold models. Table 1 
lists our experimental environments and requirements and Table 2 presents the training 
protocols. The training time is very long (about 15 days per model) because we only 
have a traditional mechanical hard disk to store the dataset and there is a bottleneck on 
CPU as well. The data loading process is very slow in this setting. Using solid state disk 
and more powerful CPU would significantly reduce the training time. 


Table 1. Environments and requirements. 


Windows/Ubuntu version | CentOS 3.10.0 


CPU Intel E5-2650 v4 Broadwell @ 2.2 GHz 
GPU NVIDIA P100 12G 
CUDA version 11.0 


Programming language | Python 3.8 


Deep learning framework | Pytorch (Torch 1.8.0, torchvision 0.2.2) 


Specific dependence nnUNet 0.6 
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Table 2. Training protocols. 


Data augmentation 
methods 


Initialization of 
the network 


Rotations, scaling, Gaussian noise, Gaussian blur, brightness, contrast, simulation of 
low resolution, gamma correction and mirroring 


“he” normal initialization 


Patch sampling 


More than a third of the samples in a batch contain at least one randomly chosen 


strategy foreground class which is the same as nn-Unet [6]. 
Batch size 2 
Patch size 128 x 128 x 128 


Total epochs 


1000 


Optimizer 


Stochastic gradient descent with nesterov momentum (u = 0.99) 


Initial learning rate 


0.01 


Learning rate decay schedule 


Poly learning rate policy: (1 — epoch/1000)°:® 


Stopping criteria, and optimal 
model selection criteria 


Stopping criterion is reaching the maximum number of epoch (1000). 


Training time 


~15 days/model 


3.2 Quantitative Results on Training Set 


We trained five-fold models for DiceCE loss and DiceTopK loss, respectively. Table 3 
presents the quantitative results (Dice) for each tumor component. DiceCE loss obtained 
better performance on fold 2-5 and DiceTopK loss obtained better performance on fold 
1. Finally, we selected the model with better performance in each fold as an ensemble 


which was used for predicting the validation set and testing set. 


Table 3. Dice scores of five-fold cross validation results on training set. In each fold, best scores 
are highlighted with bold numbers. 


Fold Loss Enhancing Tumor Whole Average 
Function Tumor Core Tumor DSC 
Fold DiceCE 0.8647 0.9134 0.9242 0.9008 
1 DiceTopK 0.8701 0.9213 0.9327 0.9080 
Fold DiceCE 0.8781 0.9225 0.9341 0.9116 
2 DiceTopK 0.8614 0.9162 0.9211 0.8996 
Fold DiceCE 0.8806 0.9299 0.9416 0.9149 
3 DiceTopK 0.8624 0.9014 0.9243 0.8960 
Fold DiceCE 0.8812 0.9243 0.9416 0.9157 
4 DiceTopK 0.8753 0.9174 0.9361 0.9096 
Fold DiceCE 0.8723 0.9145 0.9289 0.9052 
5 DiceTopK 0.8613 0.9042 0.9217 0.8957 


3.3 Quantitative Results on Validation Set 


Table 4 shows the quantitative results of validation dataset. The whole tumor achieved 
the best performance while the performance of enhancing tumor was relatively low. We 
also checked the per-cases performance and found that several cases obtained 0 Dice 
scores because the ground truth does not have enhancing tumor but the segmentation 


result does. 
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Table 4. Dice and 95% Hausdorff Distance of brain tumor segmentation on validation set. 


Metric Dice 95% Hausdorff distance 
Enhancing | Tumor | Whole | Enhancing | Tumor Whole 
Tumor Core | Tumor Tumor Core | Tumor 
Mean 0.8217 0.8786 | 0.9259 | 21.09 9.20 3.80 
Std 0.2446 0.1835 | 0.0803 81.04 43.66 | 6.28 
Median 0.9032 0.9421 | 0.9481 | 1.41 1.73 | 2.24 
25quantile | 0.8249 0.8727 | 0.9063 | 1.00 1.00 1.41 
75quantile | 0.9530 0.9686 | 0.9691 | 2.45 3.74 |3.74 


We also present the sensitivity and specificity results. Our solution has nearly perfect 
specificity, indicating that most segmentation results are real brain tumor. The sensitiv- 
ity is lower than specificity, indicating that some lesions are missed in the segmentation 
results (Table 5). 


Table 5. Sensitivity and specificity of brain tumor segmentation on validation set. 


Metric Sensitivity Specificity 
Enhancing | Tumor Whole | Enhancing | Tumor | Whole 
Tumor Core | Tumor | Tumor Core | Tumor 
Mean 0.8237 0.8621 | 0.9304 | 0.9998 0.9998 | 0.9994 
Std 0.2575 0.1978 | 0.0894 | 0.0004 0.0003 | 0.0008 


Median 0.9156 0.9383 | 0.9578 | 0.9999 0.9999 | 0.9996 
25quantile | 0.8350 0.8549 | 0.9165 | 0.9998 0.9998 | 0.9992 
75quantile | 0.9659 0.9754 0.9826 | 1.0000 1.0000 | 0.9998 


3.4 Qualitative Results on Validation Set 


Figure 2 and Fig. 3 show some visualized examples of well-segmented cases and poorly 
segmented cases, respectively. Most well-segmented cases in Fig. 2 have good contrast 
and the tumor boundaries are clear. However, the tumors in poorly segmented cases (as 
shown in Fig. 3) usually have low contrast, especially for the enhancing tumor. More- 
over, some cases have different intensity distribution (e,g., the last row in Fig. 3) from 
the training set and the trained models cannot generalize well on such cases. 


3.5 Quantitative Results on Testing Set 


Table 6 presents the final results on the testing set. Overall, the performance is compa- 
rable of even better than the performance on the validation set (Table 4), indicating that 
our method has good generalization ability. 
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Table 6. Dice and 95% Hausdorff Distance of brain tumor segmentation on testing set. 


Metric Dice 95% Hausdorff distance 
Enhancing | Tumor | Whole | Enhancing | Tumor Whole 
Tumor Core | Tumor Tumor Core | Tumor 
Mean 0.8760 0.8843 | 0.9300 12.3 15.3 4.75 
Std 0.1853 0.2293 | 0.0903 | 59.6 65.1 17.0 
Median 0.9382 0.9636 | 0.9582 | 1.00 1.41 1.73 
25quantile | 0.8549 0.9159 | 0.9166 | 1.00 1.00 1.00 
75quantile | 0.9679 0.9827 | 0.9782 | 2.00 3.00 4.12 


(a) Tlce Image (b) Segmentation (c) T2 Image (d) Segmentation 


Fig. 2. Examples of well-segmented cases. The green, yellow, and red colors denote edema, 
enhancing tumor and tumor core, respectively. (Color figure online) 
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+ 


(a) Tlce Image (b) Segmentation (c) T2 Image (d) Segmentation 


Fig. 3. Examples of poor segmentation results. The green, yellow, and red colors denote edema, 
enhancing tumor and tumor core, respectively. (Color figure online) 


4 Discussion and Conclusion 


In this work, we have trained the popular nnUNet with region-based training and loss 
ensembles to segmentation brain tumor. Experiments on the testing set show that our 
solution achieved Dice scores of 0.8760, 0.8843, and 0.9300 and 95% Hausdorff Dis- 
tance values of 12.3, 15.3, and 4.75 for enhancing tumor, tumor core, and whole tumor 
respectively. The performance could be further improved by using more powerful train- 
ing infrastructures and including more data augmentation [4,9] to reduce domain gaps. 
Moreover, our final model is an ensemble of five models. It would be interesting to 
compare different methods with only one model because model ensembles usually cost 
extensive computational resources that could hindering the deployment in real clinical 
practice. 
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Abstract. This paper proposes a Deeply Supervised Attention U-Net Deep Learn- 
ing network with a novel image mining augmentation method to segment brain 
tumors in MR images. The network was trained on the 3D segmentation task of 
the BraTS2021 Challenge Task 1. The Attention U-Net model improves upon the 
original U-Net by increasing focus on relevant feature maps, increasing training 
efficiency and increasing model performance. Notably, a novel data augmentation 
technique termed Positive Mining was applied. This technique crops out randomly 
scaled, positively labelled training samples and adds them to the training pipeline. 
This can effectively increase the discriminative ability of the Network to identify 
a tumor and use tumor feature-specific attention maps. The metrics used to train 
and validate the network were the Dice coefficient and the Hausdorff metric. The 
best performance on the online final dataset with the aforementioned network and 
augmentation technique was: Dice Scores of 0.858, 0.869 and 0.913 and Haus- 
dorff Distance of 12.7, 16.9 and 5.43 for the Enhancing Tumor (ET), Tumor Core 
(TC) and Whole Tumor (WT). 


Keywords: Attention U-Net - Brain tumor segmentation - Positive Mining 


1 Introduction 


1.1 Medical Image Segmentation 


Image segmentations in the medical context have become integral in clinical practice, 
as medical diagnoses are frequently accompanied by scanned images. These images can 
then be digitally stored and labelled by a medical professional in the relevant field to 
highlight regions of interest for diagnosis. However, the diagnosis and segmentation of 
the images requires a medical specialist and can be very inefficient, time-consuming 
and error prone. Thus, the emergence and improvement of AI models in assisting med- 
ical professionals to perform such segmentations and automatic labelling of regions of 
interest holds great potential in improving the service and quality of healthcare in the 
modern age [1]. 

Specifically for medical segmentation tasks, there has been a large growth in the 
number and type of Deep Learning Architectures that have shown state-of-the-art per- 
formances on medical segmentation challenges. This has led to architectures that show 
a strong potential of out-performing medical professionals and even providing new 
insights into understanding disease and management [2]. 
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A. Crimi and S. Bakas (Eds.): BrainLes 2021, LNCS 12962, pp. 431-440, 2022. 
https://doi.org/10.1007/978-3-03 1-08999-2_37 


432 H. S. Singh 


1.2 Brain Tumor AI Challenge (2021) 


The task this paper is trained on is specifically a brain-tumor segmentation task. The 
dataset consists of 3D multi-modal MR brain scans of brain cancer patients. This task 
is hosted as the Brain Tumor Segmentation Challenge (BraTS) 2021. BraTS is a long 
running challenge that uses multi-parametric magnetic resonance imaging (mp-MRI) 
scans. BraTS provides a very large and comprehensive annotated database of brain mp- 
MRI scans with detailed labelled segmentations [3—5, 7, 10]. The dataset [3—-5, 7, 10] for 
the year 2021 comprises of 4 modes of MR scans: a) native (T1), b) post-contrast T1- 
weighted (T1Gd), c) T2-weighted (T2) and d) T2 Fluid Attenuated Inversion Recovery 
(T2-FLAIR). These scans were collected with different protocols and scanners from 
numerous institutions. Each training sample contains 4 modal MR images and 1 labelled 
segmentation mask. The original training set contains 3 labels: a Non Enhancing Tumor 
(NET), a peritumoral edematous tissue (ED), a Necrotic Tumor Core (NCR) and the 
background is labelled as 0. A sample of the training dataset is shown below: 


Fig. 1. A sample from the BraTS 2021 training dataset to visualize the mp-MRI MR scans Top: 
T1, Second from Top: T2, Third from Top: T1Gd, Bottom: T2-Flair 


BraTS2021 provided 1251 training samples and 219 validation samples. The anno- 
tations for the validation are not released to the participants. The participants can only 
receive a score on the validation by making submissions to the official BraTS chosen 
platform. The scores on the platform for BraTs2021 were unranked (Fig. 1). 
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The winners of the most recent BraTs2020 have largely been Deep Learning Net- 
works, with the U-Net architecture showing consistent state-of-the-art performance [6]. 
Additionally, previous winners use a modified labelling framework. The original labels 
of NET, ED and NCR are transformed into 3 new labels. The ET label is unmodified, 
then a new label named Whole Tumor (WT) is made by combining the label of ED with 
TC. Finally ET, NET and NCR are combined to create a label named Tumor Core (TC). 


2 Methods 


2.1 Baseline Architecture 


The baseline architecture for the network was the 3-D Attention U-Net [9]. This archi- 
tecture is built by adding a Convolutional Block Attention Module (CBAM) [8] to the 
residual skip connections of a U-Net Architecture. This architecture is shown below: 


C=4 C=48 C=48 


C=48C=48 C=48C=48 C=3 
V 
C=48 C=% C=96 C=96C=96 C=96 C=48 
> D „e 
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\/ 
V 
C=% C=192 C=192 C=192C=192 C=192 C=96 
arassa e EOE m 
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Conv (3x3x3) 
y7 Upsampling (2x2x2) 
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& CBAM Attention Gate 
deep 
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Fig. 2. 3D Attention U-Net network with CBAM module 


2.2 CBAM Attention Mechanism 


Specifically, the CBAM is a duo-attention module that uses a channel and spatial module 
to create attention maps that can increase the “focus” of the network on more relevant 
and discriminative features amongst the channels and the spatial domain. The CBAM 
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achieves this by generating 2 attention maps [8]: a 1D channel attention map Mc (C 
x 1 x 1) that reduces, and a 2D spatial attention map Ms (1 x H x W). The Mc 
map uses average pooling (AvgPool) and max pooling (MaxPool) to down-sample an 
incoming input, which is then passed through a 1 layer Multi Later Perceptron (MLP). 
The Hadamard product (@) is calculated between the 2 pools. Finally, the output of the 
product is activated by a sigmoid function. This process is shown below: 


Mc(F) = Sigmoid (MLP (AvgPool(F)) & MLP (MaxPool(F))) 


F' =Mc(F) Q F 


Then, the next Ms map operates on F’ to create space-specific attention. It applies 
the AvgPool and MaxPool along the Channel axis, then concatenates the output along 
the same axis. It is then run through a convolutional layer (Conv) and activated by a 
sigmoid function. 


Ms (F’) = Sigmoid (Conv (AvgPool (F’) (MaxPool (F”) ) 


F" = Mc(F') 8 F' 


Thus, the final F” attention activated output can be obtained from the CBAM. This 
output can optimize the network on focus on relevant and discriminative features. This 
is proven in the ablation studies in the following sections. 


2.3 Architecture Parameters 


The 3-D Attention U-Net network’s input convolutional layer was chosen to be of size 48 
channels (C = 48). The Encoder and Decoder consist of 4 stages with a skip connection 
between each level of the stage. Starting from the left (Encoders), each input block is put 
through double convolutions of dimension 3 x 3 x 3 (H x W x L) with stride 1, and 
activated with the Rectified Linear Unit (ReLU) function. This activated output is then 
put through the CBAM attention module and concatenated to the opposing decoder via a 
skip connector. The non-attention output is down-sampled by Max Pooling of dimension 
2 x 2 x 2 with stride 2. This encoding process repeats thrice as shown in Fig. 2, with 
each successive encoder stage’s convolutional layer depth doubling until it reaches the 
third bottom most layer. Once it reaches the bottom so called bottle neck layer after 
the double convolution, 2 dilated convolutions are performed with a dilation rate of 2 
without Max Pooling. Deep supervision is performed on the convolutions following the 
encoder and dilation layers by using a 1 x 1 x 1 convolutional layer with stride 1 and 
with sigmoid activation and trilinear up-sampling [9]. 

When it enters the decoder stage on the right, it is using a 2 x 2 x 2 up-sampling 
concatenated with the attention activated skip connections from the decoders. There is 
deep supervision performed at feature map that is then up-sampled in the decoder layer. 
This process is then repeated in the decoder thrice with the convolutional layer depth 
halving at each up-sampling. Finally, a 3-label prediction is obtained at the final decoder 
output by running it through a 3-channel final convolution. 
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2.4 Loss Function and Metrics 


The loss function for training on the task is the Dice Loss [11], it is also one of the 2 
metrics BraTS 2021 uses to evaluate submissions. The BraTS2021 also uses the Haus- 
dorff [11]. However, this was not included in the loss function as the Hausdorff distance 
metric is very computationally expensive and slows down training significantly. 

The Dice Loss is based on the Dice Metric which is defined as: 


2TP 


Dice Metric = ——____— 
2TP + FN + FP 


where TP, FN and FP denote the True Positive, False Negative and False Positive 
predictions, this metric is computed individually for each label class ET, WT and TC. 
The Dice Loss can then be defined from this metric as: 


Dice Loss = 1 


1 y Pn x Rn+€ 
N Pn? + Rn? +e 


where Pn is the output of the network with some input, it is the activated prediction. 
Rn is the ground truth label. The n subscript signifies the channel (MR modes). € is a 
smoothing factor that ensures a continuous function. 

There were numerous variations of the Dice Loss tested, specifically weighted Dice 
Loss was tried in different weightages. However, there were minimal or negative per- 
formance effects. A noted difficulty was to find an optimal weightage for performance 
optimization. 


2.5 Image Processing and Augmentation 


The input image is first cropped to C x 128 x 128 x 128, where C is the channel size at 
input; which is 4 for our network corresponding to the Modal MR. This size was optimal 
to retain performance and information. This is because larger cropping sizes consid- 
erably increased computational time, while smaller dimensions caused comparatively 
significant information loss. The image is limited in its intensity to 1-99" percentile of 
any non-zero intensity values. 

Probability based augmentations were then performed in the pre-processing pipeline 
as follows: 


Dropout with a Probability of 0.2 

— Rescaling by 1.1 or 0.9 with a Probability of 0.5 

— Random flipping along chosen axis with a Probability of 0.3 
Positive Mining with a Probability of 0.1 


This pre-processing was heavily inspired by [9], though the augmentations applied in 
this paper are less aggressive as stronger augmentations were seen to decrease accuracy, 
especially on lower epoch training pipelines. Positive Mining was seen to effectively 
improve performance. This augmentation is further explained in the next section. 
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2.6 Positive Mining 


Positive Mining is a novel method and is defined as a type of image augmentation that 
extracts only the non-zero (positively) labelled regions of the training segmentation mask 
from the training image. This is defined below: 


Positively Mined Sample = Ti ® E(T,) 


where Ti is the input training sample, TL is the segmented labels, E is a function that 
randomly resizes and interpolates the non-zero label from a scale of 0.9 to 1.1. 

The actual image label is left unchanged. The E function’s scaling is done to cap- 
ture information about regions in and around the label, since these regions can provide 
potentially strong discriminative features for the network. This augmentation method 
corresponded to an increase in accuracy. This is hypothesized to be due to the network 
more effectively learning tumor isolated samples and the attention mechanisms being 
able to create new feature focus maps that centre on tumor spatial features (Fig. 3). 

The positive mining augmentation is visualized in the figure below, where the normal 
sample of the brain is in the top layer and the positively mined samples are shown at the 
bottom. 


image by Mode 0 


Positively Mined by Mode 0 ti Positively Mined by Mode 2 


Rè a © 


Fig. 3. Positive mining on sample visualized 


As seen from the figure, this also can have a strong boosting effect on the performance 
time due to the input data features being highly reduced. 

This method outperformed results on a dataset without any such Positive Mining, 
especially when training on a smaller number of epochs. There was a limited but observ- 
able impact on over-fitting when training on a cross-validation training schema. This 
might be due to the fact there is less irrelevant information that may cause over-fitting 
due to the positive mining only focusing on more relevant features. 
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2.7 Training 


Different training schemas and pipelines were used. Namely, the training on the local 
training set was done on a 5-Fold Cross Validation Model. For the final submission to 
the platform, the network was trained on the entire dataset. 

The optimizer used was Ranger, a batch size of 4 was selected and trained in parallel 
with the use of 4 GPUs. A global seed value was selected and the training pipeline 
consisted of 50 epochs due to resource and time constraints as the dataset can be very 
computationally demanding. 

Ablation results on the cross-validation set did not include the Hausdorff due to 
significant performance slowdowns. The ablation results table is shown below: 


Table 1. Cross validation results with different network and architectures 


Models Dice ET | Dice TC | Dice WT 
Normal U-Net 73.27 79.85 76.10 


Attention U-Net with CBAM 75.888 81.07 76.27 


Attention U-Net with CBAM 76.58 82.89 77.56 
and Positive Mining 


3 Results 


3.1 Validation and Final Test Phase Results 


The validation phase was a sample set of data that BraTS2021 released on their online 
platform and was used to serve as validation for a model. The Test phase was a single 
submission to test final model performance, for which the same architecture as the final 
validation was used. There were submissions of varying pipelines and the main Attention 
U-Net network architecture. The results for validation and test phases are show in the 
table below for the mean values of Dice and Hausdorff signified by H95: 


Table 2. Validation results from BRaTS2021 


Models Dice ET | Dice TC | Dice WT | H95 ET | H95 TC | H95 WT 
5-Fold Val. Training 0.753 0.808 0.899 21.8 12.5 6.45 
Total Dataset Training 0.757 0.834 0.841 28.4 11.3 15.9 
Total Dataset + Positive 0.808 0.868 0.901 16.6 9.76 4.95 
Mining 
Total Dataset + Positive 0.817 0.86 0.908 14.5 9.9 5.97 
Mining + Test Time 
Augmentation 
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Table 3. Test results from BRaTS2021 


Model Dice ET | Dice WT | Dice TC | H95 ET | H95 WT | H95 TC 
Final Model | 0.858 0.913 0.869 12.7 5.43 16.9 


For the validation phase, the 5-fold validation model served as a initial verification 
submission. It was then followed by a network that was trained on the entire dataset. 
The positive mining of 10% was combined with training this network. Lastly, the final 
submission included an additional Test Time Augmentation, which makes meant the 
input images were also augmented as pre-processing before inputting the validation 
data to the model (Fig. 4). 

Overall, the most competitive network was the Attention U-Net with CBAM module, 
with a positive mining of 10% and Test Time Augmentation included. It showed strong 
improvements across most metrics for each tumour label. A sample of a prediction vs 
label is shown below: 


3 Q 


Fig. 4. Left: Label Prediction, Right: Ground Truth label 


4 Discussion 


The architecture of using an Attention U-Net paired with positive mining trained on the 
BraTS2021 training dataset, with a Dice loss function achieves competitive results to 
accurately segment brain tumors. Notable performance gains were from the attention 
mechanisms and positive mining. Positive mining augmentation has shown promising 
results in reducing computational load and boosting accuracy. Currently, image augmen- 
tation methods are very commonly applied in segmentation tasks and have shown great 
success. Such augmentations [14] can work to help a model regularize more effectively 
as shown by the competitive validation dataset which was unseen (Fig. 5). 
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Fig. 5. Augmentation on medical segmentation tasks [15] 


Spike Artefact 


Random Swap 


Biasfield Artefact 


The parameters and pipeline chosen for the network architecture could still be fur- 
ther optimized and explored. As this paper focuses on the attention and positive min- 
ing, it did not explore in depth different combinations of loss functions, optimization 
methodologies and varying depth of U-Net encoder and decoders. 

The method of positive mining could also be further explored in terms of merging 
it with concepts such as hard-sample mining, which picks and retrains training samples 
that a given neural network performs poorly on. These can be further combined into the 
training scheme. Augmentations of sample mining have not been explored heavily in 
research, these can also be further developed into pipeline training, where successively 
hard or positively mined samples are cycled through the models training. 

Additionally, this paper’s method left the original labels unchanged when applying 
positive mining. However, when conducting testing, scaling the label mutually showed 
minor improvements. This is an avenue that can be further explored, as there have been 
known issues with inter-observer bias, especially in the field of medical segmentation 
where differing medical specialists [12] segment out positive labels by hand. This can 
give rise to high variance and subjectivity on the labels. There have been recent efforts 
[13] to decrease label bias by using deep learning to regularize already existing and 
newly generated segmentation labels. 

The major bottle neck while training on 4 modal MR images was a very high com- 
putational demand, due to a combination of the 3-D nature of the data coupled with a 
4-dimensional channel and one of the largest brain tumor datasets. Ensembles were not 
deployed for similar reasons due to the high computational demands and short timeline 
for validation and training submissions. 
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Abstract. Multi-modality brain tumor segmentation is vital for the treatment of 
gliomas, which aims to predict the regions of the necrosis, edema and tumor core 
on multi-modality magnetic resonance images (MRIs). However, it is a challeng- 
ing task due to the complex appearance and diversity shapes of tumors. Consider- 
ing that multi modality of MRIs contain rich biological properties of the tumors, 
we propose a novel multi-modality tumor segmentation network for segment- 
ing the brain tumor based on fusing the complementary information and global 
semantic dependency information upon the multi-modality imaging data. Specif- 
ically, we propose a hierarchical modality interaction block to build the internal 
relationship between complementary modality pair, and then enhance the com- 
plementary information between the them by using the channel and spatial co- 
attention. To capture the long-dependency relationship of cross-modality infor- 
mation, we propose a global modality interaction transformer block to build the 
global semantic interaction between the multi-modality local features. The global 
modality interaction Transformer block makes up for CNN’s poor perception of 
global semantic dependency information across modes. We evaluate our method 
on the validation set of multi-modality brain tumor segmentation challenge 2021 
(BraTs2021). The proposed multi-modality brain tumor segmentation network 
achieves 0.8518, 0.8808 and 0.926 Dice score for the ET, CT and WT. 


Keywords: Brain tumor segmentation - Transformer - Cross-modality 
information 


1 Introduction 


Gliomas are the most common intracranial malignant brain tumors, which arise from 
the neuroepithelial tissue and accounting for about 40%-50% of the central nervous 
system tumors. It is a malignant disease threatening human health with high recurrence 
rate and high mortality rate. Surgical resection is the main treatment for glioma. The 
principle is to remove the tumor as much as possible on the premise of preserving the 
nerve function. Accurate and automatic predicting the tumor regions in medical images 
plays a key role in the diagnosis and treatment of gliomas. It can help clinicians to speed 
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up the identification of tumor regions and improve the efficiency of preoperative plan- 
ning. However, automatically identify and segment brain tumor regions is a challenging 
task. For example, the shapes and appearances of gliomas are various, and there is no 
obvious boundary between tumor and brain tissue. The segmentation model is difficult 
to determine the accurate and complete silhouette of the tumor from the medical image 
where the discriminations between lesions and healthy tissues are unclear. 

The multi-modality magnetic resonance image (MRI) can provide complementary 
information for highlighting the lesion regions and brain tissues and is widely used for 
the diagnosis and research of brain tumors. The multi-modality MRI sequences include 
four modality [14], i.e., Tl-weighted (T1), T1 contrast-enhanced (T1c), T2-weighted 
(T2), and T2 Fluid Attenuation Inversion Recovery (FLAIR). The T1 and T1C modal- 
ity are usually considered as good sources to visualize the anatomical structure and 
necrotic (enhancing tumor) region, T2 and FLAIR modality highlight the lesion and 
peritumoral edema regions [13,14]. For the multi-modality brain tumor segmentation 
task, i.e., BraTs2021 [1—4, 14,18], the segmentation model aims to predict the sub- 
regions of brain tumor, including Whole Tumor (WT), Tumor Core (TC), and Enhanc- 
ing Tumor (ET), according to the multi-modality sequences (T1, Tic, T2 and FLAIR). 
The complementary information across multi-modality not only enhances visual differ- 
ences between the lesions and healthy tissue regions but also plays an important role to 
guide the segmentation model identifying each region of the brain tumor. 

Recently, the convolutional neural network (CNN) based brain tumor segmentation 
methods [10-12,20] have achieved success in recent BraTs challenges. Specifically, 
the U-shape network architectures [15,19,21], i.e., the encoder-decoder archiectures 
with skip connections, are mainly used for improving the performance of the brain 
tumor segmentation. The skip connections fuse the features between the encoding and 
decoding pathways to recover the lost spatial information caused by down-sampling. 
The conventional CNN based methods simply assign different modality to different 
channels, due to the lack of information interaction mechanism between the channels, 
the rich cross-modality information has not been fully explored. 

To make full use of cross-modality information, in this work, we proposed a Trans- 
former [8] and NNUnet [11] combination network for multi-modality brain tumor seg- 
mentation. Specifically, we establish a designed complementary relationship between 
multimodal MRIs according to the property of each modality. The important informa- 
tion of brain tumor sub-regions can be reasonably enhanced by using the channel-wise 
and spatial-wise co-attention [16] between the complementary modality pairs. To fur- 
ther improve the performance of multi-modality brain tumor segmentation, we intro- 
duce the Transformer [8] to our network to learn the global semantic dependency infor- 
mation across modality. The Transformer [8] compares the semantics of each local fea- 
ture and other local features from different modalities, which can capture not only the 
local dependencies between local adjacent semantic features but also the global depen- 
dencies between remote cross-modality features. This global dependence helps improve 
the performance of brain tumor segmentation by integrating a wider range of cross- 
modality context information. Figure 1 shows the overall architecture of our proposed 
brain tumor segmentation network. 
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Fig. 1. Overview of the proposed network architecture. The network is U-net based architec- 
ture [11]. The hierarchical modality interaction co-attention block captures the complementary 
information of different modalities, the global modality interaction transformer block captures 
the cross-modality global semantic dependency information, the skip connection fuse the multi- 
scale complementary information and cross-modality global semantic dependency information 
for brain tumor segmentation. 


2 Method 


2.1 Overall Network Structure 


We employ the U-net shape 3D encoder-decoder architecture [11] with skip connection 
as the backbone to extract the feature of each modality and predict the segmentation 
of the brain tumor. Four MRI modalities of each patient with size 240 x 240 x 155 
are concatenated into a four channel tensor following the order of T1, Tlc, T2 and 
FLAIR, which is yielded as an input of our network with size 4 x 240 x 240 x 155. 
The output feature maps of each encoding block are divided equally into four sections 
along the channel to present the features of multi modalities. The hierarchical modality 
interaction co-attention block takes the multi modalities features as input to enhance 
the complementary information between the modality pairs by using spatial and chan- 
nel common co-attention (SCCA). At the end of last encoder lock, the multi modalities 
features are fed to the proposed global modality interaction transformer block to learn 
the global semantic dependency information between the multi modality images. The 
decoder blocks use the skip connections to fuse the multi-scale cross-modality com- 
plementary information, the global semantic dependency information for segmenting 
the brain tumor sub-regions in the multi-modality MRIs. The brain tumor segmentation 
prediction including three channels, i.e., 3 x 240 x 240 x 155, where each channel 
presents the sub-region of tumor: ET, TC and WT, respectively. 
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2.2 Hierarchical Modality Interaction Co-attention 
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Fig. 2. Illustration of the hierarchical modality interaction co-attention block. 


Multi-modality MRIs provide rich biological properties of the sub-regions of brain 
tumor. We proposed a hierarchical modality interaction co-attention block to capture 
the cross-modality complementary information, which could improve the perception 
sensitivity of the feature extractors for brain tumor sub-regions information. To achieve 
this goal, we design a cross-modality interaction strategy to guide the channel and spa- 
tial co-attention (SCCA) to capture the complementary relationships between modal- 
ity pairs. The SCCA re-calibrates channel-wise feature responses and highlights the 
co-interesting feature between complementary modality pairs. Based on the features 
of T1 and T2, the hierarchical modality interaction co-attention block is progressively 
employed to fuse the features of the multi modality. 

Figure 2 illustrates the architecture of the hierarchical modality interaction co- 
attention block. The block divides the output feature map fe; with size C x H x W x D 
of the encoder block E; into four sections with size C'/4 x H x W x D along the 
channel dimension, i.e., fr1, fric, fr2 and fFLAIR, each of which presents the local 
feature map of corresponding modality. Then, the important information in the local 
detail feature maps fr1, fric, fr2 and frz arr are enhanced by following strategies: 


1) The T1 feature map fy; is enhanced by T2 modality fro, i.e., fri = 
SCC A(fr1ı; fr2). The T2 modality more significantly reflects the lesion region of 
the tumor than T1 modality, while, the T1 modality contains rich information of the 


Hierarchical and Global Modality Interaction for Brain Tumor Segmentation 445 


health-tissues. This strategy encourages feature extractor to learn the discriminative 
information of the lesion and healthy brain tissue regions. 

We use the feature map of T1 modality fr to restrain the healthy tissue features in 
the T2 modality, i.e., Êro = SCC A(fr2; frı), which encourages the feature extrac- 
tor to learn the information of whole lesion region including necrotic, tumor core 
and edema regions in T2 modality. 

For the FLAIR modality, we use the enhanced feature map of T2 modality fro 
to reinforce the information of edema region in the feature map f FLAIR, 1€., 
frrarr = SCCA(frFrArR; fro). 

For the T1c modality frıc, we use the enhanced feature map of T1 modality fri to 
reinforce the information of necrotic region, i.e., tris = SCCA(fric; Îr1). 


2 


wm 


3 


wm 


4 


Ym 


The SCC A(f,; fọ) refers to the spatial and channel co-attention operation which 
is applied to the features of complementary modality pairs (f,,f,). In this work, 
SCCA(f.; fo) enhances the modality feature fa by using the channel-wise and spatial- 
wise attention of the modality feature f}. The channel co-attention CC'A(f,; fẹ) can be 
formulated as: 


CCA(E,; fp) = o(W°* F 5(W 7*2 AvgPool(f,))) © fa, (1) 


where, the © is the element-wise multiplication, AvgPool(-) refers the 3D average 
pooling operation for the 4D tensor fp € ROXĦXWXD the WOX Z and W2*C 
present the parameters of two fully connected layers. The 5(-) and o(-) refers ReLU and 
Sigmoid activation respectively. The spatial co-attention SC A(f,; fọ) can be defined as: 


SCA(f,; fo) = o(W!*!*1£,) © fa, (2) 


where The W!*!*? refers a convolutional layer with a kernel size of 1 x 1 x 1. The 
channel and spatial co-attention SSC A(fa; fp) is defined as: 


SCCA(E,; fo) = 6(W?*3*3 (CCA(E,; fp) + SCA(fa: fp))- (3) 


2.3 Global Modality Interaction Transformer 


We present a global modality interaction transformer block to capture the cross- 
modality global semantic dependency information for brain tumor segmentation. The 
Transformer is good at to learn explicit global and long-range semantic dependency 
[6,7] from the input sequence. Therefore, we employ the Transformer for building 
the global dependency of local semantic information in each modality, which can help 
the brain tumor segmentation model to extract more powerful cross-modality features. 
Figure 3 shows the architecture of the proposed global modality interaction transformer 
block. The Transformer block consists of L Transformer layers (L = 6), each of which 
starts with the multi-head attention (MHA) building the global dependencies between 
the local features of multi modalities and enhance them by a feed-forward layer (FFL). 
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Fig. 3. Illustration of the global modality interaction transformer block. “MHA”, “FFL” and “LN” 
refer to the Multi-head attention layer, the feed-forward layer and the normalization layer, respec- 
tively. 


Given the output feature maps of the latest aa block for the four modality: fri, 
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a sequence as input. We Spread the spatial dimension of each feature maps into one 
dimension, i.e., S71 € Re XN Sric € Rx’ {ea € RIN and SrLAIR ERT XN. 


where, N = E y w x Is D and C = 320. Then, we oe the modality Features 


along the dimension N to merge a feature sequence S with size g x 4N. The input of 


Transformer based fusion block can be formulated as S = { Sai. Sac: So, S FLAIR}. 
The local features in each modality are treated as a token and fed into the Transformer 
block to learn global semantic interaction information. We also introduce the learnable 
position embeddings [6,7,17] Pe € R‘**4N and fuse them with the feature sequence 
for encoding the location information of each local detail features for brain tumor seg- 
mentation: 


Zo = W x S + Pe, (4) 


where, W is the linear projection operation, Zo is the input feature embeddings for the 
first Transformer layer. The Transformer layer in this work has a standard architecture 
as in previous works [6,7,17], which consists of a multi-head attention (MHA) layer, 
a feed forward layer (FFL) and two normalization layer (LN). The output of the /-th 
(l = 1, ..., 6) Transformer layer z; can be calculated by: 


zı = FFL(LN (z) +2) 2, = MHA(LN (z-1) + 21-1). (5) 


The of the Transformer block is divided along dimension N into four parts, and 
then, a male d and a cannel wise concatenation aggregate them into a 4D feature map 


with size C x Æ x W x 2 for facilitating subsequent decoding operations. 


i 16 16 


2.4 Network Encoder Pathway 


We follow the nnUnet architecture [11] to build the encoder network. It consists of five 
encoder blocks, each of which contains a down-sampling layer (convolutional kernel 
size is 3 x 3 x 3, stride = (2, 2, 2)) and a 3D convolutional layer (convolutional kernel 
size is 3x 3x3, stride = (1, 1, 1)). The output of the encoder block is fed to a hierarchical 
modality interaction co-attention block for capturing the complementary information 
between multi modalities. At the end of the last cross-modality internal relationship 
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block, the output is fed to the proposed transformer based fusion block for capturing 
the cross-modality global semantic interaction information. 


2.5 Network Decoder Pathway 


The shallow features in the encoder pathway contain rich detail information of the tumor 
sub-regions which is important for predict the refined segmentation result for brain 
tumors. In this work, we use the skip connections to fuse the multi-scale cross-modality 
complementary information for recovering the lost detail information are caused by 
down-sampling. We also integrate the cross-modal semantic dependency information 
into each decoder block to make full use of cross-modal information for improving the 
performance of the brain tumor segmentation network. 

The Decoder network consists of four decoder blocks, each of which also has the 
same network architecture as the decoder block in nnUnet [11], i.e., a 3D deconvo- 
lutional layer (kernal size is 3 x 3 x 3, stride=(2,2,2)) for up-sampling and a 3D 
deconvolutional layer (kernal size is 3 x 3 x 3, stride=(1,1,1)) for feature recover- 
ing. For each decoder block, the skip connections fuse the multi-scale complementary 
information and the global semantic dependency information, and the up-sampled fea- 
tures. The multi-scale complementary information and the global semantic dependency 
information are sampled into the same size and concatenated with the up-sampled fea- 
tures from the previous decoder block along channel dimension. Finally, the decoder 
network outputs the segmentation for the sub-regions ET, TC and WT of brain tumors. 
Considering that there will be some noise in the segmentation result of ET, therefore, 
we employ the connected component-based post-processing [5,9] to remove the noise 
regions in segmentation results. 


2.6 Training 


The proposed methods were implemented in PyTorch on an PH402 SKU 200 GPU 
with 32 GB memory. We employ the cross-entropy loss function to train our proposed 
network on training data set of BraTs2021 [1—4, 14]. Each sample in the training data 
is centered cropped to size 192 x 160 x 108. This ensures that the useful information 
of each sample is kept within the cropping boundary while minimizing the content- 
free areas of the sample. We used Adam to optimize the entire network parameters 
from scratch with the initial learning rate 1 x 107° and the batch size is 1. The training 
process took 1000 epochs, the learning rate decreases according to the strategy of “poly” 


learning rate strategy[11]: (1 — epoch/epochmaz)°”?. 


3 Results 


The proposed multi-modality brain tumor segmentation network is evaluated on valida- 
tion set of BraTs 2021. The segmentation results of the proposed network are reported 
in Table |. The proposed method have received mean Dice scores of ET, WT and TC as 
0.8518, 0.8808 and 0.926 on validation set, respectively. The Hausdorff95, sensitivity 
and specificity are also reported in Table 1. We also show the quantitative analysis of 
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Segmentation 


BraTS21_Validation_ 174. Dice score of 0. on 0.982, 0.986 for ET, TC and WT. 


JJa 


BraTS21_Validation_190. Dice score of 0.981, 0.993, 0.990 for ET, TC and WT. 


Fig. 4. Qualitative results on BraTs2021. The enhancing tumor (ET) is shown in red, tumor core 
(TC) in blue and edema (WT) in yellow. (Color figure online) 


the comparable study of proposed work with the baseline work [11] in Table 2. Our pro- 
posed network achieves better brain tumor segmentation results in Dice scores for each 
class than the baseline work. This experiment demonstrates that the proposed cross- 
modality detail interaction information and cross-modality global semantic interaction 
information fusion strategy can effectively improve the performance of multi-modality 
brain tumor segmentation. The Fig.4 shows the qualitative results of our method on 


Table 1. Segmentation results of ET, CT and WT on BraTs 2021 Challenge Validation set in 
terms of the Dice score, Hausdorff95, Sensitivity and Specificity. All scores are evaluated online. 


Metrics ET TC WT 

Dice (%) 85.178 | 88.079 | 92.605 
Hausdorff95 (mm) | 6.034) 7.397| 3.653 
SensitiVITy (%) | 84.374 | 85.932 | 92.821 
Specificity (%) 99.982 | 99.982 | 99.936 


Table 2. Comparison results of our method and Baseline [11] on the BRATS 2021 validation set 
in term of Dice (%). All scores are evaluated online. 


Method ET TC WT 
Baseline [11] | 79.293 | 87.239 | 92.388 
Ours 85.178 | 88.079 | 92.605 
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the validation data set of BraTs2021. In general, the qualitative and quantitative results 
have proved the effectiveness of the proposed method. 


4 Conclusion 


In this work, we proposed a hierarchical and global modality interaction network for 
multi-modality brain tumor segmentation. In Each scale of encoder block, the local 
features of complementary modality pairs are hierarchically interacted for capturing the 
cross-modality complementary information by using channel-wise and spatial-wise co- 
attention. We also proposed global modality interaction Transformer block to extract the 
global cross-modality semantic dependencies information. The proposed brain tumor 
segmentation network has been evaluated on the validation set of BraTs 2021 Challenge 
and achieved high Dice scores of 0.8518, 0.8808, and 0.926 for the tumor sub-regions 
ET, TC, and WT, respectively. 
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Abstract. Brain tumor segmentation remains an open and popular 
challenge, for which countless medical image segmentation models have 
been proposed. Based on the platform that BraTS challenge 2021 pro- 
vided for researchers, we implemented a battery of cutting-edge deep 
neural networks, such as nnU-Net, UNet++, CoTr, HRNet, and Swin- 
Unet to directly compare performances amongst distinct models. To 
improve segmentation accuracy, we first tried several modification tech- 
niques (e.g., data augmentation, region-based training, batch-dice loss 
function, etc.). Next, the outputs from the five best models were averaged 
using a final ensemble model, of which four models in the committee were 
organized in different architectures. As a result, the strengths of every 
single model were amplified by the aggregation. Our model took one of 
the best performing places in the Brain Tumor Segmentation (BraTS) 
2021 competition amongst over 1200 excellent researchers from all over 
the world, which achieved Dice score of 0.9256, 0.8774, 0.8576 and Haus- 
dor Distances (95%) of 4.36, 14.80, 14.49 for whole tumor, tumor core, 
and enhancing tumor respectively. 


Keywords: Brain tumor segmentation - Ensemble learning - 
nnU-Net - UNet++ - CoTr - HRNet 


1 Introduction 


Glioma is one of the most aggressive and fatal brain tumor. The precise segmen- 
tation of glioma based on medical images plays a crucial role in treatment plan- 
ning, computer-aided surgeries, and health monitoring. However, the ambiguous 
boundaries of tumors and their variations in shape, size and position, pose dif- 
ficulties in distinguishing them from brain tissues. It is especially challenging 
for the traditional medical domain to accurately and automatically segment the 
glioma tissues. 

The BraTS Challenge provides a platform which enables researchers to fairly 
evaluate their state-of-the-art algorithms in segmenting brain glioma. The BraTS 
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challenge has been running since 2012, attracting top research teams around the 
world each year. In 2021, the challenge was jointly organized by the Radiologi- 
cal Society of North America (RSNA), the American Society of Neuroradiology 
(ASNR), and the Medical Image Computing and Computer Assisted Interven- 
tions (MICCAI) society. Around 1200 researchers participated in the challenge. 
The number of cases collected by the BraTS committee has drastically risen from 
660 to 2000 in 2021, compared to last year [15]. The dataset consists of 1251 train- 
ing, and 219 validation cases, while the test data are not open to the public [1-5]. 
Multi-parametric magnetic resonance imaging (mpMRI) includes four modalities 
available for all cases: the native T1l-weighted (T1), post-contrast T1-weighted 
(T1Gd), T2-weighted (T2), and T2-weighted Fluid Attenuated Inversion Recov- 
ery (T2-FLAIR) images [13]. BraTS evaluates brain glioma sub-regions segmen- 
tation, including the enhancing tumor (ET), the tumor core (TC), and the whole 
tumor (WT) [14]. 

Due to the rapid development of deep learning, various newly evolved deep 
neural networks outperformed traditional algorithms. A state-of-the-art medi- 
cal image segmentation method termed U-Net was first introduced in 2015 [6]. 
The encoder-decoder based deep neural network with skip-connections achieved 
an advanced performance. Since then, numerous algorithms have been developed 
using the U-Net as the backbone. A self-adaptive UNet-like neural network called 
nnU-Net (no new U-Net) can automatically optimize multiple processes includ- 
ing preprocessing, network architecture, and post-processing without few manual 
interventions [7]. Another recently proposed cutting-edge U-shaped transformer 
neural network named Swin-Unet has given a demonstrated performance on 
multi-organ and cardiac segmentation challenges [11]. In addition to the impres- 
sive performance of these individual models, ensemble learning aggregating two 
or more models could achieve better and more generalizable results. The most 
popular ensemble methods include ensemble mean, ensemble vote, ensemble 
boosting, and ensemble stacking methods. Ensemble mean is a method that 
averages predictions across multiple models to make the most of them. Ensem- 
ble vote methods calculates the votes and accept the majority votes, which could 
lower result variances. Ensemble boosting methods train models based on mis- 
takes from previous models, and ensemble stacking methods use a model to 
combine predictions from different types of models. 

In this study, we implemented multiple different models and applied the 
ensemble learning to collaborate them. We used two metrics, Dice similarity 
coeffi-cient (Dice) and Hausdor Distance (HD), to evaluate model performance. 
Dice ranges from 0 to 1, which indicates the similarity between predicted and 
ground truth, and HD signifies the largest segmentation error. To promote model 
accuracy, we have added several modification methods. The final ensemble model 
gave an unprecedented result based on the selected top-performing models. In 
Sect. 2, we will briefly introduce the main model architectures that have been 
utilized on the BraTS dataset, then further implementation details will be intro- 
duced. In Sect.3, we described the performances of individual model and the 
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ensemble model. Lastly, Sect. 4 will discuss all the findings of the current research 
and potential improvement for future studies will be proposed. 


2 Methods 


2.1 Ensemble Learning 


Ensemble is the most popular fusion method. Not only does it address advantages 
over various models, but it also improves the overall predictive performances, 
while increases the robustness and generalization. Ensemble mean averages the 
unweighted output from multiple models, whereas the ensemble vote takes the 
unweighted voting results from the majority models. The ensemble mean model 
showed a convincing performance over most individual models as well as the 
ensemble vote method. The detailed individual models attempted in the current 
study are explained as follow: 


NnU-Net. F. Isensee et al. proposed a powerful automatic biomedical image 
segmentation, named nnU-Net (no new net), which can be trained out-of-the-box 
to segment diverse 3D medical datasets and requires zero manual intervention 
and expert knowledge. nnU-Net surpasses a broad variety of datasets in many 
international biomedical image segmentation competitions [8]. Due to the great 
success of nnU-Net has performed in the medical image segmentation competi- 
tions, we applied nnU-Net on the BraTS2021 dataset, the baseline model with- 
out any modifications has already achieved an impressive performance on this 
auxiliary domain. nnU-Net is well-known for its U-Net-like architecture, a sym- 
metric encoder-decoder structure with skip-connections. The encoder completes 
downsampling, and the decoder upsamples the salient features passed from the 
bottleneck. Both encoder and decoder have five convolutional layers and are 
connected by a bottleneck block. 

Despite the architecture itself, hyper-parameter is another key determinant 
in influencing the overall model performance. Data are normalized before being 
fed into the first layer. The input patch size is 128 x 128 x 128, and uses a 
batch size of 2, followed by a Leaky ReLU function to handle data nonlinearities. 
Skip-connections collect high-resolution features from the encoder to reduce the 
spatial information loss caused by downsampling. At the end of the decoder, 
a 1x11 filter is applied to guarantee the number of channels is 3, then the 
output is passed to a softmax function. Loss function sums Dice and Cross- 
Entropy (CE) loss during 1000 training epochs, consisting of 250 iterations per 
epoch, and the initial learning rate is 0.001. 


U-Net++. U-Net++ is one of the popular variants that uses U-Net as the 
backbone. U-Net performance is hindered by its suboptimal depth design and 
the same level feature maps fusion through skip-connections. To overcome the 
shortcomings listed above, U-Net+-+ embedded nested U-Nets and redesigned 
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the skip-connections. To this end, pruning is allowed to dispose the burden of 
unnecessary layers and parameters, while maintaining its outstanding segment- 
ing ability [9]. 

The initialized input patch size is 96 x 96 x 96 with a batch size of 2 and 
followed by an Instance Normalization (IN) layer. The 3D model has trained 
320 epochs on Dice and HD95 loss and using 0.001 as the learning rate. 


High-Resolution Net (HRNet). Unlike many state-of-the-art architectures, 
HRNet does not encode input images into low-resolution representations and 
then decode information from the salient features. On the contrary, HRNet keeps 
high-resolution representations throughout the entire process. Hence, more pre- 
cised semantic and spatial information are maintained to its architecture. Brain 
tumor segmentation is a position-sensitive task. Comparing with other model 
structures, HRNet can improve the ability to capture detailed positional infor- 
mation [10]. Therefore, we further developed a 3D version HRNet to implement 
on the BraTS dataset. 

The model has been trained by 128 x 128 x 128 input images, with a total 
number of 320 epochs, where 250 iterations were performed per epoch. We 
adopted a small batch size equals to 2, the initial learning rate is 0.001, and 
the sum of Dice loss and CE are used for model evaluation. 


Swin-Unet. Due to the transformer’s convincing performance in the Natu- 
ral Language Processing (NLP) domain, Swin-Unet is developed to draw its 
strength in long-term semantic segmentation and transferring to the computer 
vision domain. Swin-Unet is the first pure transformer-based Unet variant, 
with a symmetric transformer-based encoder and decoder, which are intercon- 
nected by skip-connections. According to H. Cao et al. [11], Swin-Unet effec- 
tively solves the over-segmentation problem encountered by Convolutional Neu- 
ral Network(CNN)-based models. 

Comparing with HRNet, it shares similar parameters with the Swin-Unet 2D 
model. However, its 2D performance is a lot worse than expected. Hence, the 3D 
model has not been completed. 


CoTr. CNN has achieved a competitive performance, but its performance is 
still inevitably hindered by its limited receptive fields. Since Transformer can 
effectively address this issue, Xie et al. [12] proposed a novel architecture that 
combines CNN and Transformer. The introduced architecture in CoTr success- 
fully inherits the advantages of CNN and Transformer. 

The patch size of 128 x 128 x 128 is fed into the three-stage-algorithm, each 
stage consists of one transformer and one convolutional layer. CoTr opt for Dice 
and CE as the loss function. The model has trained 320 epochs (250 iterations 
per epoch) with a learning rate of 0.001. 
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2.2 Data Augmentation (DA) 


Limited data can seriously constrain model performance, especially on the 
unseen dataset. Therefore, data augmentation is necessary, as it expands the 
limited dataset and supports the models in gaining more insights. Each data has 
a 20% chance of being scaled, rotated, increased in contrast or mirrored, where 
the probability is randomized and independent of each other. 


2.3 Batch Normalization (BN) 


Batch normalization is believed in bringing benefits like faster convergence, more 
robustness, better generalization and mitigating overfitting. 


2.4 Batch Dice (BD) 


The Batch-wise Dice loss is computed over the batch. This approach avoids 
large targets dominating the prediction results [17]. BD processes the data as 
an integral sample, computed over all samples in the batch. Unlike minibatches 
which assume samples are independent. Hence, the model is less sensitive to the 
imperfect predictions [18]. However, according to our empirical results, batch dice 
actually degrades model performance, the implementation details are explained 
in Sect. 3.1. 


2.5 Postprocessing 


The predicted enhancing tumor (ET) are sometimes too small to be taken into 
consideration. In other words, when the predicted enhancing tumor volume is 
smaller than some thresholds, it can be replaced with necrosis labels [18]. The 
best threshold is selected via optimizing the ET Dice. Postprocessing may sac- 
rifice the HD95 ET score by a small amount, but ET Dice can be improved by 
around 2%, or even more. 

The performance of three selected models before and after postprocessing 
are shown in Table 1 for comparison. Obviously, the ET Dice improvement is 
attributed to postprocessing. 


2.6 Evaluation Metrics 


Evaluating model performance is an essential process. The following metrics are 
the main tools used to measure medical image segmentation qualities. 


Dice Similarity Coefficient (Dice). Dice quantifies how closely the predic- 
tion matches the ground truth, a perfect prediction results in 1, and 0 vice versa. 


Hausdorff Distance 95 (HD95). HD measures the longest distance between 
the predictions and the ground truth. HD95 calculates the 95th percentile of the 
distance, which reduce the impact caused by a small number of outliers. 
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Table 1. The performances of each model (before and after postprocessing) are pre- 
sented in the table, the results for individual Dice and HD95 predictions are listed on 
the left, and the mean values are on the right. Postprocessing is specially designed to 
improve the Dice scores of enhancing tumors (ET). Although the postprocessing could 
cause slight sacrifices in the HD95 mean, but the overall results after postprocessing 
outperform the results original predictions. 


Model Dice HD95 Dice mean | HD95 mean 
WT TC ET WT|TC ET 
nnU-Net 0.9322 | 0.9032 | 0.8405 | 4.57 | 4.66 | 14.69 | 0.8919 7.97 
nnU-Net(Post) | 0.9322 | 0.9032 | 0.8599 | 4.57| 4.66 11.47 | 0.8984 6.90 
UNet 0.9292 | 0.9097 | 0.8491 | 4.58 | 7.54 | 13.54 | 0.8960 8.55 
UNet++(Post) | 0.9292 | 0.9097 | 0.8593 | 4.58 | 7.54 13.28 | 0.8994 8.47 
CoTr 0.9322 | 0.9111 | 0.8501 | 4.41 | 5.97 | 13.11 | 0.8978 7.83 
CoTr(Post) 0.9322 | 0.9111 | 0.8643 | 4.41 | 5.97 | 14.24 | 0.9025 8.21 


Cross Entropy (CE). CE is another widely used loss function that aims to alle- 
viate the negative influences caused by an imbalanced dataset. Hence, other class 
re-balancing methods or weighted class training techniques can be neglected. 


Specificity and Sensitivity. Specificity and sensitivity measure how valid a 
test is. The sensitivity of ET measures whether the model is capable of correctly 
identifying the enhancing tumor. Specificity of ET measures the model’s abil- 
ity in correctly differentiating the surrounding brain tissues from the enhancing 
tumor. In other words, sensitivity demonstrates the true-positive rate and speci- 
ficity demonstrates the true-negative rate of the model in terms of identifying 
tumors. 


3 Results 


3.1 Batch Dice (BD) Implementation 


In the initial design, the BD modification wad expected to improve the segmen- 
tation ability, but in practice, it does the exact opposite. Our results suggest that 
BD implementation has the tendency of over delineating normal brain tissues 
as enhancing tumor. Table 2 compares two nnU-Net models that share the same 
hyperparameters. As seen, the one trained without BD outperforms the others. 

Visual predictions are depicted in Fig. 1. Enhancing tumor areas are plotted 
in red, blue defines the tumor core, and the whole tumor areas are shown in 
green. Obviously, Fig. la shows that BD is prone to over delineate enhancing 
tumor parts, while Fig. 1b proves that predictions without BD are closer to the 
ground truth. BD updates Dice more frequently, resulting in its overemphasis 
on tiny components and neglecting the whole picture. 
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Table 2. Batch Dice (BD) Implementation Comparison: the whole tumor (WT), tumor 
core (TC), enhancing tumor (ET) Dice and HD95 score of nnU-Net are shown in the 
first row respectively, where the second row compares the nnU-Net model with the BD 
modification. Batch dice loss function degrades the nnU-Net model overall performance, 
especially on HD95 score. 


Model Dice HD95 Dice mean | HD95 mean 
WT TC ET WT |TC |ET 

nnU-Net 0.9283 | 0.9074 | 0.8466 | 4.45 | 7.91 | 11.04 | 0.8941 7.80 

nnU-Net + BD | 0.9237 | 0.9025 | 0.8472 | 6.31 | 9.43 | 15.06 | 0.8911 10.27 


(c) ground truth 


Fig. 1. Comparison of nnU-Net Neural Network performance before and after using 
Batch Dice (BD). Green indicates the whole tumor, red indicates the enhancing tumor, 
and blue indicates the tumor core. la depicts three tumor regions predicted by nnU- 
Net with batch dice loss function. 1b depicts three tumor regions predicted by nnU-Net 
without batch dice loss function. 1c displays the ground truth. As the region pointed by 
the arrow suggests, model with BD predicts normal brain tissues as enhancing tumor, 
whereas models without BD do not have this tendency. 


3.2 Individual Model Comparison 


All baseline models without hyperparameter tuning are summarized in Table3, 
all of which achieved approximately 89% mean of Dice. Slight variations can be 
seen in the HD95 mean, but all results are around 8. 

Since the baseline of nnU-Net already showed a convincing performance in 
segmenting brain tumors, we fine-tuned the baseline and further compared other 
combinations of hyperparameters. First, the training epoch was increased from 
320 to 500. Ideally, increasing the number of times that the model learns from the 
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Table 3. Baseline Models Summary: all baseline models with zero hyperparemeter 
tuning have already achieved good performances with approximately Dice mean of 
89% and HD95 mean of 8. 


Model Dice HD95 Dice mean | HD95 mean 
WT TC ET WT (TC |ET 

nnU-Net 0.9322 | 0.9032 | 0.8405 | 4.57 | 4.66 | 14.69 | 0.8919 7.97 

CoTr 0.9321 | 0.9108 | 0.8498 4.42 5.99 | 13.16 0.8976 7.85 

UNet++ 0.9292 | 0.9097 | 0.8491 | 4.58 | 7.54 | 13.54 | 0.8960 8.55 

HRNet 0.9226 | 0.8976 | 0.8404 | 6.75 | 11.58 | 15.44 | 0.8869 11.25 

Swin-Unet (2D) | 0.9214 | 0.8796 | 0.8370 6.23 9.10 | 14.47 | 0.8793 9.93 


training set, allows the model to minimize the error. The experimental findings 
showed that Dice and HD95 scores were optimized to 0.9056 and 7.69 respec- 
tively in 500 training epochs with Adam optimizer after postprocessing. Sec- 
ond, unbalanced data could potentially mislead the model in producing severely 
biased results. Thus, we replaced Dice loss with Tversky loss function [19], which 
evolves from Dice specialized to overcome this challenge. The proposed method 
did not change Dice much but has improved the HD95 by 10%. In addition, brain 
tumor segmentation can be viewed as a pixel-wise classification task. P. Arbeláez 
et al. designed a brand-new region-based object detector that classifies every sin- 
gle pixel and aggregates the votes to come up with the final segmentation result 
[20]. We implemented this method along with multiple DA approaches. However, 
none of them surpasses the existing models. 


3.3 Ensemble Model Comparison 


Ensemble is an effective way to make the utmost of the combination of multiple 
models. To compare the ensemble models in terms of averaging and voting meth- 
ods, we compare the same models with these two methods. Specifically, ensemble 
mean and vote models of nnU-Net, CoTr, UNet++, HRNet, and Swin-Unet are 
developed for comparison. The ensemble mean model slightly outperformed the 
vote model with 0.9036 Dice mean and 8.13 HD95 mean, whereas ensemble vote 
ends up with 0.9019 and 8.15 for Dice mean and HD95 mean respectively. Other 
comparisons are in line with the above findings. In other words, the ensemble 
mean consistently beats ensemble vote. For this reason, ensemble vote will not 
be studied further. 


3.4 Overall Comparison 


Amongst all of the models attempted, five best performing models on the val- 
idation set are selected for the aggregation of the final ensemble mean model. 
The every single model and the ensemble model validation results are presented 
in Table 4. As elaborated above, the ensemble model produces the highest Dice 
mean of 0.8783, the best HD95 score of the whole tumor and the tumor core 
equals to 3.65 and 7.65 respectively. 
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Table 4. Model Comparison on Validation Dataset: The five best-performing indi- 
vidual models are selected to aggregate the ensemble mean model. According to the 
validation results, the overall ensemble model showed highest Dice mean than each 
single model and acceptable HD95 mean. 


Model Dice HD95 Dice mean | HD95 mean 
WT TC ET WT |TC |ET 
nnU-Net 500 * | 0.9230 | 0.8727 | 0.8349 | 3.83 | 7.73 | 24.18 | 0.8769 11.91 
HRNet 320P | 0.9234 | 0.8689 0.8284 | 3.72 | 7.85 | 20.80 | 0.8734 10.79 
nnU-Net 320 © | 0.9195 | 0.8752 0.8372 | 5.61 | 7.70 | 20.91 | 0.8773 11.41 
UNet++ 320 4 | 0.9210 | 0.8659 0.8217 | 4.13 | 8.15 | 27.65 | 0.8695 13.31 
CoTr 320 ° 0.9207 | 0.8566 | 0.8349 | 4.43 | 10.06 | 19.68 | 0.8707 11.39 
Ensemble f 0.9258 | 0.8747 | 0.8344 | 3.65 | 7.65 | 24.14 | 0.8783 11.81 


a nnU-Net with AdamW optimizer and 500 training epochs, postprocessing using optimal 
threshold equals to 500. 

b HRNet with half channel, AdamW optimizer, and 320 training epochs, postprocessing 
using optimal threshold equals 500. 

© nnU-Net with AdamW optimizer and 320 training epochs, postprocessing using optimal 
threshold equals to 500. 

d UNet++ three stage with AdamW optimizer and 320 training epochs, postprocessing 
using optimal threshold equals to 500. 

© CoTr with AdamW optimizer and 320 training epochs, postprocessing using optimal 
threshold equals to 500. 

f Ensemble of the above five models, average outputs with equal weights, and postprocessing 


using the optimal threshold equals to 750. 
(b) predict result 


(a) T2 modality 


Fig. 2. A graphical example is demonstrated in this figure. Figure 2a on the left illus- 
trates an example of input T2 modality, the high-signal gray areas are the abnormal 
regions; Fig. 2b visualised the predicted results by our best model. Green indicates the 
whole tumor(WT), red refers to the enhancing tumor(ET), and tumor core(CT) is 
demonstrated in blue. (Color figure online) 
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4 Discussion 


In the current research, we have implemented numerous cutting-edge deep neural 
networks, in terms of individual and ensemble models with various architectures 
to improve segmentation of brain tumors. We found prediction Dice and HD95 
were strengthened using ensemble models and the ensemble mean consistently 
outperformed other aggregating models. The final model was obtained from top 
five models according to their 5-fold cross-validation and postprocessing results. 
The code is written in PyTorch, and all the models were conducted on AWS 
p2.xlarge, Tesla K80 (12G GPU RAM), RAM 61G, 4vCPU. These models are 
trained with AdamW optimizer but with different number of epochs. To be spe- 
cific, CoTr 320 training eopchs, three-stage unet++ with 320 epochs, nnU-Net 
with 320 and 500 training epochs, and HRNet with half channel 320 train- 
ing epochs, which produces the best predictions on validation data with whole 
tumor, tumor core and enhancing tumor Dice score of 0.9256, 0.8774, 0.8576, 
and HD95 score of 4.36, 14.80, and 14.49 correspondingly. 

By implementing deep neural networks with various architectures, like nnU- 
Net, UNet++, HRNet, Swin-Unet, and CoTr, our research has addressed differ- 
ent challenges in brain tumor segmentation. In addition, hyperparameters are 
further fine-tuned to obtain better performance, such as loss function, DA, post- 
processing, etc. The advantages of various models are strengthened by aggregat- 
ing the unweighted averages, which is in line with previously reported findings 
demonstrated by K. Kamnitsas et al. [21]. 

According to our research, a few techniques have been tried in an attempted 
to boost model performance, yet the results were unsatisfactory. First of all, 
The exemplary results emphasize that BD does not improve the model abil- 
ity in depicting tiny tumors, but instead depicts the background as part of the 
enhancing tumor. Since BD failed to capture small tumors, we chose to remove 
the entire enhancing tumor if it is less than some thresholds. Indeed, postpro- 
cessing effectively optimized ET loss, but tiny enhancing tumor recognition is 
critical in clinical practice. Moreover, the specially designed region-based opti- 
mization method and DA are not efficient in enhancing model accuracy. Last but 
not least, more epochs do not guarantee excellent performance but may cause 
overfitting problems and bring negative effects. 


Future Work. Firstly, although the predominant ensemble approach achieves 
excellent performance, the individual models still have the potentials to be fur- 
ther developed. According to our research, HRNet is capable of capturing abun- 
dant high-resolution information and may thus better handle complex segmen- 
tation problems, which underlies the potential of delving into relative studies. 

Secondly, as reflected in Table3, the lowest Dice and HD95 mean scores 
indicate that CoTr worth an in-depth study, which future research could extend 
upon the current study to make further exploration. 

Finally, the ensemble mean does give reliable predictions, but the prediction 
accuracy of some models on the committee is slightly worse than others. In this 
case, reducing the corresponding model weights is expected to get better results. 
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Abstract. Glioblastoma is an aggressive type of cancer that can develop in the 
brain or spinal cord. Magnetic Resonance Imaging (MRI) is key to diagnosing 
and tracking brain tumors in clinical settings. Brain tumor segmentation in MRI is 
required for disease diagnosis, surgical planning, and prognosis. As these tumors 
are heterogeneous in shape and appearance, their segmentation becomes a chal- 
lenging task. The performance of automated medical image segmentation has 
considerably improved because of recent advances in deep learning. Introducing 
context encoding with deep CNN models has shown promise for semantic segmen- 
tation of brain tumors. In this work, we use a 3D UNet-Context Encoding (UNCE) 
deep learning network for improved brain tumor segmentation. Further, we intro- 
duce epistemic and aleatoric Uncertainty Quantification (UQ) using Monte Carlo 
Dropout (MCDO) and Test Time Augmentation (TTA) with the UNCE deep learn- 
ing model to ascertain confidence in tumor segmentation performance. We build 
our model using the training MRI image sets of RSNA-ASNR-MICCAI Brain 
Tumor Segmentation (BraTS) Challenge 2021. We evaluate the model perfor- 
mance using the validation and test images from the BraTS challenge dataset. 
Online evaluation of validation data shows dice score coefficients (DSC) of 0.7787, 
0.8499, and 0.9159 for enhancing tumor (ET), tumor core (TC), and whole tumor 
(WT), respectively. The dice score coefficients of the test datasets are 0.6684 for 
ET, 0.7056 for TC, and 0.7551 for WT, respectively. 


Keywords: Glioblastoma - Segmentation - Deep neural network - Uncertainty - 
Monte Carlo dropout - Test time augmentation 


1 Introduction 


Glioblastoma (GBM) is the most common and aggressive malignant primary tumor 
of the central nervous system (CNS) in adults, with extreme intrinsic heterogeneity 
in appearance, shape, and histology [1]. Patients diagnosed with the most aggressive 
type of brain tumor have a median survival time of two years or less [2]. Accurate 
brain tumor segmentation is important not only for treatment planning but also for 
follow-up evaluation [3]. Manual brain tumor segmentation is time-consuming, less 
efficient, and prone to error [4, 5]. Therefore, it is desirable to have a computer-aided 
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image-based robust automatic brain tumor segmentation system. Different strategies for 
brain tumor segmentation have been investigated, including threshold-based, region- 
based, and traditional machine learning-based methods [6—9]. However, those methods 
are limited because of the complex mathematic model or difficult hand-crafted feature 
extraction. Recent deep learning (DL) models-based automated tumor segmentation 
systems have demonstrated considerable performance improvements [3, 10]. DL has 
made it feasible to build large-scale trainable models that can learn the best features for 
a specific task [11]. To achieve successful tumor segmentation, DL models often require 
many training examples. It is a very challenging task to build an accurate deep learning 
model because of the lack of biomedical and bioimaging datasets. Therefore, proper 
regularization and hyper-parameter tuning are required for developing an efficient DL 
network. 

Inspired by the popular deep learning architecture known as UNet [12] and the 
concept of context encoding designed for semantic segmentation [13], we implement a 
state-of-the-art deep UNet-Context Encoding (UNCE) framework for automatic brain 
tumor segmentation. Additionally, we compute the uncertainty using a combination of 
Monte Carlo dropout (MCDO) and test time augmentation (TTA) of data to improve the 
overall performance and to obtain a confidence measure in the brain tumor segmentation 
outputs [14]. 


2 Methods 


2.1 Data Descriptions 


(a) TI (b) Tice (c) T2 (d) T2-Flair 


Fig. 1. Examples of four different MRI modalities of two different training samples: (a) T1, (b) 
Tlce, (c) T2 and (d) T2-FLAIR. 


The RSNA-ASNR-MICCAI Brain Tumor Segmentation (BraTS) Challenge 2021 
dataset is obtained from multiple different institutions under standard clinical conditions 
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[1]. Different institutions used different equipment and imaging protocols which resulted 
in a vastly heterogeneous image quality reflecting diverse clinical practices across dif- 
ferent institutions. Ground truth annotations of every tumor sub-region for brain tumor 
segmentation were approved by expert neuroradiologists [15-18]. The BraTS 2021 train- 
ing dataset consists of 1251 cases with the ground truth labels. In the validation phase, 
there are 219 cases are provided without any associated ground truth. Each patient case 
has four MRI modalities: T1-weighted (T1), T1-weighted contrast enhancement (T Ice), 
T2-weighted (T2), and T2-weighted fluid-attenuated inversion recovery (T2- FLAIR). 
All modality sequences are co-registered, skull-stripped, denoised, and bias-corrected 
[19]. Image size is 240 X 240 X 155 for each imaging modality. Tumors have different 
sub-tissues: necrotic (NC), peritumoral edema (ED), and enhancing tumor (ET). Figure 1 


shows the images of four MRI modalities (T1, Tlce, T2 and FLAIR) of two different 
training examples. 


2.2 Network Architecture 
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Fig. 2. Overview of the UNet-Context Encoding (UNCE) Network. 


The UNet is a convolutional network architecture used for fast and precise segmen- 
tation of images. The bottleneck layer of UNet captures the global semantic context 
features of the scene with rich contextual information. In this work, we have imple- 
mented an UNCE network architecture that integrates multiple volumetric MRI process- 
ing tasks. Inspired by the work of context encoding network [13], the UNCE architecture 
is substantially augmented for brain tumor segmentation. 

An overview of the UNCE deep learning method for tumor segmentation is shown 
in Fig. 2. The UNCE captures global texture features using a semantic loss to provide 
regularization in training. The architecture consists of encoding, context encoding, and 
decoding modules. The encoding module extracts high-dimensional features of the input. 
The context encoding module produces updated features and a semantic loss to regularize 
the model by ensuring all segmentation classes are represented. The decoding module 
reconstructs the feature maps to produce segmentation masks as output. Figure 2 presents 
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a detailed architecture of the proposed UNCE model including the parameter settings 
used at each layer. 


2.3 Implementation 


The MRI scans provided for the competition are co-registered, skull-stripped, denoised 
and bias corrected. Since the dataset was collected from different institutions and MRI 
scanners, the intensities show substantial variations across examples. Consequently, we 
perform normalization of all examples to have zero mean and unit standard deviation. 
The dimension of each training sample is 240 x 240 x 155. The size of the images is 
reduced by cropping to a size of 192 x 160 x 128 to manage computational memory 
and cost. To generate additional training images, we augment data by adding uniform 
white noise with limited amplitude. 

A critical feature of the proposed UNCE is the context encoding module, which 
computes scaling factors related to the representation of all classes. These factors are 
learned simultaneously in the training phase via the semantic loss error regularization, 
defined by Lge. The scaling factors capture global information of all classes, essentially 
learning to mitigate the training bias that may arise due to imbalanced class representa- 
tion in data. To calculate the Semantic Error loss (SE-loss), we construct another fully 
connected layer with a sigmoid activation function upon the encoding layer, so that pre- 
dict object classification in the image [13]. Accordingly, the final loss function consists 
of 2 terms: Subsequent paragraphs, however, are indented. 


L= Liice F Lse (1) 


where Lidice is a Dice calculated by the difference between prediction and ground truth, 
and Lge is the sematic loss. 
Dice loss is computed as: 


Liice = 1-—-DSC (2) 
where DSC is dice similarity coefficient [20]. The DSC is defined as, 


2TP 


DSC = ——— 
FP +2TP + FN 


(3) 
where TP, FP and FN are the numbers of true positive, false positive and false negative, 
respectively. 

We use Adam optimizer [20] with initial learning rate of [rọ = 0.0001 in training 
phase, and the learning rate (/r;) is gradually reduced by following: 


i 0.9 
lr; = Irox(1 — N? (4) 


where i is epoch counter, and N is a total number of epochs in training. 
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Fig. 3. Training loss and training dice vs number of epochs. 
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Fig. 4. T2 image overlaid with our prediction. From left to right: axial, coronal, sagittal view 
respectively. 
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2.4 Training 


We implement the context-aware deep learning network in PyTorch and train the network 
on NVIDIA V100 HPC platform using the 2021 BraTS training dataset (1251 cases). 
To train the network, we use 80% of all training data and the remaining 20% of the 
training data is used to validate the trained model. We train the network using one of the 
regularization techniques known as a dropout. In this implementation, 30% of random 
weight was dropped to train the network. The same network was also trained without 
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applying the dropout. The UNCE architecture is trained for over 300 epochs and the best 
performing versions based on the validation set are retained for testing. Figure 3 shows 
the training loss and training soft dice curve of the UNCE for the BraTS 2021 dataset. 
Effective training of the network is observed by the monotonically decreasing loss and 
the corresponding increase in training dice score. The network achieves an overall dice 
score of 0.90 by the first 110 epochs and continues to further improve performance at 
a slower pace. At the end of 300 epochs, we get an average training loss of 0.0701 and 
training soft dice reaches 0.9131. Once the network is fully trained, the performance of 
the network is evaluated using the BraTS 2021 validation dataset (219 cases) utilizing the 
online submission process made available by the challenge organizers. Figure 4 presents 
an example segmented tumor and the corresponding ground truth of T2 images. The 
trained model segmented results are very close to the ground truth. 
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Fig. 5. Monte Carlo dropout computation framework 
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2.5 Uncertainty Measures 


Bayesian probability theory offers mathematically grounded tools to reason about model 
uncertainty. However, Bayesian techniques with deep learning usually result in models 
with prohibitive computational costs. Therefore, it may be prudent to utilize methods 
that are able to approximate the predictive posterior distribution without changing either 
the models or the optimization [22]. For instance, Monte Carlo Dropout (MCDO) uti- 
lizes the dropout layers within a deep neural network at test time to conduct Monte 
Carlo sampling of the parametric posterior to obtain an approximation of the predictive 
posterior distribution. We adopt this technique for the UNCE as follows: 1) activate 
dropout layers in UNCE at test time. 2) Replicate each testing image y, N times and 
pass each through the dropout enabled UNCE. This essentially runs each replication 
through a different version of UNCE generated by random dropout masks. 3) obtain the 
N different predictions P,(y), where each prediction is a vector of SoftMax scores for 
the C classes. 4) Then we compute the average prediction score [23] for the N samples 
as follows. 


1 N 
PNO) = 57 Dopey Pa) (5) 
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In the MCDO technique, we turn on the dropout during the evaluation. Addition- 
ally, we conduct test time augmentation of inputs to further increase the sample size 
and to capture epistemic uncertainty in the predictions. The final segmentation results 
are obtained by averaging over all the outputs obtained after applying a combination of 
MCDO and TTA to a given testing example as shown in Eq. (5). Figure 5 shows the 
framework of the MCDO method. The output from each Monte Carlo sampling model 
is fused to get the final segmented results. 


3 Results and Discussion 


Table 1. Performance of the BraTS 2021 validation data for three models. 


UNCE Model Statistical Dice Score Hausdorff95 
parameter | ET TC WT ET TC WT 
No Mean 0.7696 |0.8281 |0.9121 |20.24 |12.10 | 4.71 
Dropout std 0.2489 | 0.2395 |0.0762 |77.34 |50.10 | 7.64 
Median 0.8532 |0.9278 | 0.9348 2.24 2.24 2.83 


25quantile | 0.7713 0.8324 | 0.8894 1.41 1.41 1.73 
75quantile (0.9016 |0.9544 | 0.9547 3.16 5.15 4.64 


Dropout Mean 0.7720 |0.8465 | 0.9141 20.51 7.03 | 4.07 
std 0.2555 | 0.2011 0.0746 |77.36 |26.43 5.65 
Median 0.8589 =| 0.9291 0.9350 2.24 2.00 | 2.45 


25quantile |0.7870 /|0.8376 | 0.8947 1.41 1.41 1.41 
75quantile (0.9052 |0.9570 | 0.9575 3.19 5.00 | 4.24 


MCDO + TTA | Mean 0.7786 | 0.8499 |0.9159 | 20.26 6.71 4.35 
std 0.2501 0.2031 0.0752 |77.35 | 25.97 | 6.98 
Median 0.8606 | 0.9310 | 0.9366 2.00 2.00 | 2.45 


25quantile | 0.7837 | 0.8321 0.8980 1.41 1.41 1.41 
75quantile (0.9084 |0.9579 | 0.9588 3.00 5.01 4.24 


We first obtain the best performing UNCE models using the training dataset pro- 
vided by the RSNA-ASNR-MICCAIT Brain Tumor Segmentation (BraTS) Challenge 
2021 organizer. We use three models (Model without dropout, Model applying dropout 
and model using combined MCDO and TTA) to generate the segmented results. These 
models are then evaluated using the validation dataset through the online evaluation tool 
Table 1. Shows the DSC, and Hausdorff distance of the validation dataset of three such 
models for 219 examples. The DSC quantifies the similarity between tumor segmen- 
tation and ground truth and Hausdorff distance measures how far two subsets (tumor 
segmentation and ground truth) of a metric space are from each other. The dice scores 
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of the model without dropout are 0.7698, 0.8281 and 0.9121 for enhancing tumor (ET), 
tumor core (TC) and whole tumor (WT) respectively. The hausdorff95 distances of the 
same model are 20.24 for ET, 12.10 for TC and 4.71 for WT. The model that is trained 
by applying dropout in each layer of the network show improvements in the dice scores 
and hausdorff95 distances for all three tumor cases. The third model uses MCDO and 
TTA to further improve performance through the uncertainty quantification strategy dis- 
cussed in Sect. 2.5. For this experiment, we set the Monte Carlo sample number N = 100 
and obtain the average of corresponding sample outputs to form the final segmentation 
results. The evaluation results show a further improvement of dice scores compared to 
the other two models. 

The final model was tested by the BraTS challenge organizers. Following the require- 
ment of the challenge, we prepared a Docker image of our model. In the test phase, 570 
MRI images were segmented to evaluate the model performance. Note among the 570 
test examples, 87 were not segmented successfully. Table 2 shows the test results for our 
model. 


Table 2. Performance of the BraTS 2021 Test data. 


UNCE Model Statistical Dice Score Hausdorff95 
parameter | ET TC WT ET TC WT 
MCDO + TTA |Mean 0.6684 0.70562 | 0.7551 74.26 | 74.36 62.77 
std 0.3527 | 0.38792 0.3544 | 145.76 | 144.47 | 132.87 
Median 0.8424 | 0.9239 0.9262 2.23 12.23 3 
25quantile | 0.6582 0.5461 0.8148 141 |1 1.41 
75quantile | 0.8984 0.9601 0.9588 6.38 | 14.08 11.08 


4 Conclusion 


In this work, we use a UNet-context encoding 3D deep learning based method for brain 
tumor segmentation. The model takes advantage of context encoding, which captures 
global context features. Semantic error loss is used to regularize the model that helps to 
manage the class specific sample bias that exists within the tumor tissue segmentation 
task. BraTS2021 training dataset is used to generate and train several versions of the 
proposed UNet-Context Encoding (UNCE) model and evaluate three representative ver- 
sions using the validation dataset provided by the BraTS20201 organizers. The results 
show that the UNCE version combined with the Monte Carlo Dropout (MCDO) and 
Test Time Augmentation (TTA) based uncertainty quantification framework yields the 
best performance. 
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Abstract. Glioblastomas are the most aggressive fast-growing primary brain can- 
cer which originate in the glial cells of the brain. Accurate identification of the 
malignant brain tumor and its sub-regions is still one of the most challenging 
problems in medical image segmentation. The Brain Tumor Segmentation Chal- 
lenge (BraTS) has been a popular benchmark for automatic brain glioblastomas 
segmentation algorithms since its initiation. In this year, BraTS 2021 challenge 
provides the largest multi-parametric (mpMRI) dataset of 2,000 pre-operative 
patients. In this paper, we propose a new aggregation of two deep learning frame- 
works namely, DeepSeg and nnU-Net for automatic glioblastoma recognition in 
pre-operative mpMRI. Our ensemble method obtains Dice similarity scores of 
92.00, 87.33, and 84.10 and Hausdorff Distances of 3.81, 8.91, and 16.02 for the 
enhancing tumor, tumor core, and whole tumor regions, respectively, on the BraTS 
2021 validation set, ranking us among the top ten teams. These experimental find- 
ings provide evidence that it can be readily applied clinically and thereby aiding 
in the brain cancer prognosis, therapy planning, and therapy response monitoring. 
A docker image for reproducing our segmentation results is available online at 
(https://hub.docker.com/r/razeineldin/deepseg2 1). 


Keywords: Brain - BraTS - CNN - Glioblastoma - MRI - Segmentation 


1 Introduction 


Glioblastomas (GBM), the most common and aggressive malignant primary tumors 
of the brain in adults, occur with ultimate heterogeneous sub-regions including the 
enhancing tumor (ET), peritumoral edematous/invaded tissue (ED), and the necrotic 
components of the core tumor (NCR) [1, 2]. Still, accurate GBM localization and its sub- 
regions in magnetic resonance imaging (MRI) are considered one of the most challenging 
segmentation problems in the medical field. Manual segmentation is the gold standard 
for neurosurgical planning, interventional image-guided surgery, follow-up procedures, 
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and monitoring the tumor growth. However, identification of the GBM tumor and its sub- 
regions by hand is time-consuming, subjective, and highly dependent on the experience 
of clinicians. 

The Medical Image Computing and Computer-Assisted Interventions Brain Tumor 
Segmentation Challenge (MICCAI BraTS) [3, 4] has been focusing on addressing this 
problem of finding the best automated tumor sub-region segmentation algorithm. The 
Radiological Society of North America (RSNA), the American Society of Neuroradiol- 
ogy (ASNR), and MICCAI jointly organize this year’s BraTS challenge [2] celebrating 
its 10 anniversary. BraTS 2021 provides the largest annotated and publicly available 
multi-parametric (mpMRI) dataset [2, 5, 6] as a common benchmark for the development 
and training of automatic brain tumor segmentation methods. 

Deep learning-based segmentation methods have gained popularity in the medi- 
cal arena outperforming other traditional methods in brain cancer analysis [7—10], more 
specifically the convolutional neural network (CNN) [11] and the encoder-decoder archi- 
tecture with skip connections, which are first introduced by the U-Net [12, 13]. In the 
context of the BraTS challenge, the recent winning contributions of 2018 [14], 2019 [15], 
and 2020 [16] extend the encoder-decoder pattern by adding variational autoencoder 
(VAE) in [14], two-stage cascaded U-Net [15], or using the baseline U-Net architecture 
with making significant architecture changes [16]. 

In this paper, we propose a fully automated CNN method for GDM segmentation 
based on an ensemble of two encoder-decoder methods, namely, DeepSeg [10], our 
recent deep learning framework for automatic brain tumor segmentation using two- 
dimensional T2 Fluid Attenuated Inversion Recovery (T2-FLAIR) scans, and nnU-Net 
[16], a self-configuring method for automatic biomedical segmentation. The remainder 
of the paper is organized as follows: Sect. 2 describes the BraTS 2021 dataset and the 
architecture of our ensemble method. Experimental results are presented in Sect. 3. This 
research work is concluded in Sect. 4. 


2 Materials and Methods 


2.1 Data 


The BraTS 2021 training database [2] includes 1251 mpMRI images acquired from 
multiple institutions using different MRI scanners and protocols. For each patient, there 
are four mpMRI volumes: pre-contrast Tl-weighted (T1), post-contrast T1-weighted 
(T1Gd), T2-weighted (T2), and T2-FLAIR, as shown in Fig. 1. Ground truth labels 
are provided for the training dataset only indicating background (label 0), necrotic and 
non-enhancing tumor core (NCR/NET) (label 1), peritumoral edema (ED) (label 2), 
and enhancing tumor (ET) (label 4). These labels are combined to generate the final 
evaluation of three regions: the tumor core (TC) of labels | and 4, enhancing tumor (ET) 
of label 4, and the whole tumor (WT) of all labels. Also, the BraTS 2021 includes 219 
validation cases without any associated ground truth labels. 
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(c) (d) 


Fig. 1. A sample of the mpMRI BraTS 2021 training set. Shown are images slices in two different 
MRI modalities T2 (a), T1Gd (b), T2-FLAIR (c), and the ground truth segmentation (d). The color 
labels indicate Edema (blue), enhancing solid tumor (green), and non-enhancing tumor core, and 
necrotic core (magenta). Images were obtained by using the 3D Slicer software [17]. 


2.2 Data Pre-processing 


The BraTS 2021 data were acquired using different clinical protocols, from different MRI 
scanners and multiple institutions, therefore, a pre-processing stage is essential. First, 
standard pre-processing routines have been applied by the BraTS challenge as stated in 
[2]. This includes conversion from DICOM into NIFTI file format, re-orientation to the 
same coordinate system, co-registration of the multiple MRI modalities, resampling to 
1 x 1 x 1 mm isotropic resolution, and brain extraction and skull-stripping. 

Following these pre-processing steps, we applied the image cropping stage where all 
brain pixels were cropped, and the resultant image was resized to a spatial resolution of 
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192 x 224 x 160. This method effectively results in a closer field of view (FOV) to the 
brain with fewer image voxels leading to a smaller resource consumption while training 
our deep learning models. Finally, z-score normalization was applied by subtracting 
the mean value and dividing by the standard deviation individually for each input MRI 
image. 


2.3 Neural Network Architectures 


We used two different CNN models, namely, DeepSeg [10] and nnU-Net [9] which follow 
the U-Net pattern [12, 13] and consist of encoder-decoder architecture interconnected by 
skip connections. The final results were obtained by using the Simultaneous Truth and 
Performance Level Estimation (STAPLE) [18] based on the expectation-maximization 
algorithm. 


DeepSeg. Figure 2 shows a 3D enhanced version of our first model, DeepSeg, which 
is a modular framework for fully automatic brain tumor detection and segmentation. 
The proposed network differs from the original network in the following: First, the 
original DeepSeg network was proposed for 2D tumor segmentation using only FLAIR 
MRI data, however, we apply here 3D convolutions over all slices for more robust and 
accurate results. Second, we incorporate all the available MRI modalities (T1, T1Gd, 
T2, and T2-FLAIR) so that the GBM sub-regions could be detected in comparison 
with the whole tumor only in the original DeepSeg paper [10]. Third, we incorporate 
additional modifications such as region-based training, excessive data augmentation, a 
simple postprocessing technique, and a combination of cross-entropy (CE) and Dice 
similarity coefficient (DSC) loss functions. 


Following the structure of U-Net, DeepSeg consists of two main parts: a feature 
extractor part and an image upscaling part. Downsampling is performed with 2 x 2 x 
2 max-pooling and upsampling is performed with 2 x 2 x 2 up convolution. DeepSeg 
uses the recently proposed advances in CNNs including dropout, batch normalization 
(BN), and rectified linear unit (ReLU) [19, 20]. The feature extractor consists of five 
consecutive convolutional blocks, each containing two 3 x 3 x 3 convolutional layers, 
followed by ReLU. In the image upscaling part, the resultant feature map of the feature 
extractor is upsampled using deconvolutional layers. The final output segmentation is 
attained using a1 x 1 x 1 convolutional layer with a softmax output. 


nnU-Net. The baseline nnU-Net is outlined in Fig. 3, which is a self-adaptive 
deep learning-based framework for 3D semantic biomedical segmentation [9]. Unlike 
DeepSeg, nnU-net does not use any of the recently proposed architectural advances in 
deep learning and only depends on plain convolutions for feature extraction. nnU-Net 
used strided convolutions for downsampling and convolution transposed for upsampling. 
The initial filter size of convolutional kernels is set to 32 and doubled at the following 
layers with a maximum of 320 in the bottleneck layers. 


By modifying the baseline nnU-Net and making BraTS-specific processing, nnU-Net 
won first place in the segmentation task of the BraTS challenge in 2020 [16]. The softmax 
output was replaced with a sigmoid layer to target the three evaluated tumor sub-regions: 
whole tumor (consisting of all 3 labels), tumor core (label 1 and label 4), and enhancing 
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Fig. 2. DeepSeg network consists of convolution neural blocks (blue boxes), downsampling using 
maximum pooling (orange arrows), and upsampling using up convolution (blue arrows), and 
softmax output layer (green block). The input patch size was set to 128 x 128 x 128. (Color 
figure online) 


tumor (label 4). Further, the training loss was changed to a binary cross-entropy instead of 
categorical cross-entropy that optimized each of the sub-regions independently. Also, the 
batch size was increased to 5 as opposed to 2 in the baseline nnU-Net and more aggressive 
data augmentations were incorporated. Similar to DeepSeg, nnU-Net utilized BN instead 
of instance normalization. After all, the sample dice loss function was changed to batch 
dice by computing over all samples in the batch. In our experiments, we incorporated 
the top-performing nnU-Net configuration on the validation set of BraTS 2020. 


Sa | 
a 320x4x4x4 r a =a 
o] 3x3x3 conv - IN - IReLU 1x1x1 conv - softmax = = = Skip connection 
(stride n,n,n; default=1,1,1) | 
32x128x128x128 Feature map size P Convolution Transposed | Input 


Fig. 3. nnU-Net network consists of strided convolution blocks (grey boxes), and upsampling as 
convolution transposed (blue arrows). The input patch size was set to 128 x 128 x 128 and the 
maximum filter size is 320 [16]. (Color figure online) 
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2.4 Post-processing 


Determining the small blood vessels in the tumor core (necrosis or edema) is one of 
the most challenging segmentation tasks in the BraTS Challenge. In particular, this is 
clear in low-grade glioma (LGG) patients where they may not have enhancing tumors 
and, therefore, the BraTS challenge evaluates the segmentation as binary values of 0 or 
1. Although if there are only small false positives in the predicted segmentation map 
of a patient with no enhancing tumor will result in a dice value of 0. To overcome this 
problem, all enhancing tumor output were re-labeled with necrotic (label 1) if the total 
predicted ET regions are less than a threshold. This threshold value was selected based on 
our analysis of the validation set results so that our model performs better. This strategy 
has a possible side effect of removing some correct predictions. 


3 Experiments and Results 


3.1 Cross-validation Training 


We train each model as five-fold cross-validation on the 1251 training cases of BraTS 
2021 for a maximum of 1000 epochs. Adam optimizer [21] has been applied with an 
initial learning rate of le~*+ and a default value of le~’ for epsilon. Each configuration 
was trained on a single Nvidia GPU (RTX 2080 Ti or RTX 3060). The input to our 
networks is randomly sampled patches of 128 x 128 x 128 voxels with varying batch 
sizes from 2 to 5 and the post-processing threshold is set to 200 voxels. This tiling strategy 
allows the model to be trained on multi-modal high-resolution MRI images with low 
GPU memory requirements. The DeepSeg model was implemented using Tensorflow 
[22] while nnU-Net was implemented using PyTorch [23]. 

For training DeepSeg, the loss function is a combination of CE and DSC loss 
functions, which can be calculated as follows: 


2x) yp+e 
y+ Lpre 


where p denotes the network softmax predictions and y € {0, 1} representing the ground 
truth binary value for each class. Note that ¢ is the smooth parameter to make the dice 
function differentiable. 

To overcome the effect of class imbalance between tumor labels and the brain healthy 
tissue, we apply on-the-fly spatial data augmentations during training (random rotation 
between 0 and 30°, random 3D flipping, power-law gamma intensity transformation, or 
a combination of them). 


LpeepSeg = DSC + CE = X yog(p) d) 


3.2 Online Validation Dataset 


The results of our models on the BraTS 2021 validation set are summarized in Table 1, 
where the five models for each cross-validation training configuration are averaged as an 
ensemble. Two evaluation metrics are used for the BraTS 2021 benchmark, computed 
by the online evaluation platform of Sage Bionetworks Synapse (Synapse), which are 
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the DSC and the Hausdorff distance (95%) (HD95). We compute the averages of DSC 
scores and HD95 values across the three evaluated tumor sub-regions and then use them 
to rank our methods in the final column. 

DeepSeg A refers to the baseline DeepSeg model, which has large input patches of 
the full pre-processed image, smaller batch size of 2. With DSC values of 81.64, 84.00, 
and 89.98 for the ET, TC, and WT regions, respectively, DeepSeg A model yields good 
results, especially when compared to the inter-rater agreement range for manual MRI 
segmentation of GDM [24, 25]. By using a region-based version of DeepSeg with an 
input patch size of 128 x 128 x 128 voxels, batch size of 5, applied post-processing 
stage, and on-the-fly data augmentation, the DeepSeg B model achieved better results 
of DSC values of 82.50, 84.73, and 90.05 for the ET, TC, and WT regions, respectively. 

Additionally, we used two different configurations of the BraTS 2020 winning app- 
roach nnU-Net [16]. The first model, nnU-Net A, is aregion-based version of the standard 
nnU-Net, large batch size of 5, more aggressive data augmentation as described in [16], 
trained using batch Dice loss, and including the postprocessing stage. nnU-Net B model 
is very similar to nnU-Net A model with applied brightness augmentation probability 
of 0.5 for each input modality compared with 0.3 for model A. nnU-Net models ranks 
second and third in our ranking (see Table 1) achieving an average DSC and HD95 
results of 87.78, 87.87 and 9.6013, 10.1363 for each model, correspondingly. 

For the RSNA-ASNR-MICCAI BraTS 2021 challenge, we selected the three top- 
performing models to build our final ensemble: DeepSeg B + nnU-Net A + nnU-Net B. 
Our final ensemble was implemented by first predicting the validation cases individually 
with each model configuration, followed by averaging the softmax outputs to obtain the 
final cross-validation predictions. After that, the STAPLE [18] was applied to aggregate 
the segmentation produced by each of the individual methods using the probabilistic 
estimate of the true segmentation. Our ensemble method is ranked among the top 10 
teams for the BraTS 2021 segmentation challenge. 


Table 1. Results of our five-fold cross-validation models on BraTS 2021 validation cases. All 
reported values were computed by the online evaluation platform Synapse. The average of DSC 
and HD95 scores are computed and used for ranking our methods. 


Model DSC HD95 Rank 
ET TC WT | Avg =| ET TC WT | Avg 
DeepSeg A 81.64 | 84.00 | 89.98 | 85.21 | 19.77 | 10.25 |5.11 | 11.71 |5 
DeepSeg B* 82.50 | 84.73 | 90.05 | 85.76 | 21.36 | 12.96 | 8.04 | 14.12 | 4 
nnU-Net A** 84.02 | 87.18 | 92.13 | 87.78 | 16.03 | 8.95 | 3.82 | 9.60 | 2 
nnU-Net B*** 83.72 | 87.84 | 92.05 | 87.87 | 17.73 | 8.81 | 3.87 | 10.14 | 3 
Ensemble (*, **, ***) | 84.10 | 87.33 | 92.00 | 87.81 | 16.02 | 8.91 |3.81 | 9.58 | 1 
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Worst: BraTS2021_Validation_01739, EC (0), TC (85.34), WT (95.72) 


Fig. 4. Sample qualitative validation set results of our ensemble model. The best, median, and 
worse cases are shown in the rows. Columns display the T2, T1Gd, and the overlay of our predicted 
segmentation on the T1Gd image. Images were obtained by using the 3D Slicer software [17]. 
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3.3 Qualitative Output 


Figure 4 shows the qualitative segmentation predictions on the BraTS 2021 validation 
dataset. These outcomes were generated by applying our ensemble model. The rows 
show the best, median, and worse segmentations based on their DSC scores, respectively. 
From this figure, it can be seen that our model achieves very good results with the overall 
high quality. Although the worst case, BraTS2021_Validation_01739, has a TC of zero, 
this finding was not quite surprising as illustrated in Sect. 2.4 as a side effect of applying 
our postprocessing strategy. Notably, the WT region was detected with a good quality 
(DSC of 95.72) which could be already valuable for clinical use. 


3.4 BraTS Test Dataset 


Table 2 summarizes the final results of the ensemble method on the BraTS 2021 test 
dataset. Superior results were obtained for the DSC of ET, while all other obtained DSC 
results were broadly consistent with the validation dataset. In contrast, a substantial 
discrepancy between validation and test datasets for the HD95 is visible. Although 
our results were not state-of-the-art for the BraTS 2021 challenge, the proposed method 
showed better or equal segmentation performance to the manual inter-rater agreement for 
tumor segmentation [3]. The results confirm that our method can be used to guide clinical 
experts in the diagnosis of brain cancer, treatment planning, and follow-up procedures. 


Table 2. Results of our final ensemble models on the BraTS 2021 test dataset. All reported values 
were provided by the challenge organizers. 


DSC HD95 

ET TC WT ET TC WT 
Mean 87.63 87.49 91.87 12.13 6.27 14.89 
StdDev 18.22 24.31 10.97 59.61 27.79 63.32 
Median 93.70 96.04 95.11 1.00 2.00 1.41 
25quantile 85.77 91.33 91.09 1.00 1.00 1.00 
75quantile 96.62 98.20 97.22 1.73 4.12 3.00 


4 Conclusion 


In this paper, we described our contribution to the segmentation task of the RSNA-ASNR- 
MICCAI BraTS 2021 challenge. We used an ensemble model of two encoder-decoder- 
based CNN networks namely, DeepSeg [10] and nnU-Net [16]. Table 1 and Table 2 list 
the results of our methods on the validation set and test set, respectively. Remarkably, 
our method achieved DSC of 92.00, 87.33, and 84.10 as well as HD95 of 3.81, 8.91, and 
16.02 for, ET, TC, and WT regions on the validation dataset, respectively. For the testing 
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dataset, our final ensemble yielded DSC of 87.63, 87.49, and 91.87 in addition to HD95 
of 12.1343, 14.8915, and 6.2716 for ET, TC, and WT regions, correspondingly. These 
results ranked us among the top 10 methods for the BraTS 2021 segmentation challenge. 
Furthermore, qualitative evaluation supports the numerical evaluation showing a high- 
quality segmentation. Our clinical partner suggested that this approach can be applied 
for guiding brain tumor surgery. 
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